|Email Newsletters from Pharmaceutical Technology and Pharmaceutical Technology Europe|
Providing the latest business, scientific, and regulatory news for the pharmaceutical and biotech industries.
News from Europe's pharmaceutical manufacturing industry coupled with upcoming events, and exclusive articles and interviews from industry experts.
Overcoming Disincentives to Process Understanding in the Pharmaceutical CMC Environment
Variability reduction through process understanding is a common element shared by quality and productivity improvement objectives. Many companies are not obtaining the benefits of a variability reduction system because of real or perceived regulatory hurdles. Several current practices of regulatory quality systems penalize a company in its attempt to gain better process understanding through the collection of more data. These include the use of individual results instead of averages as batch estimates, the misuse of 3σ control chart limits as acceptance criteria, the misinterpretation of compendial tests, and fixed acceptance criteria regardless of sample size. Some of these concerns have been published:
USP compendial limits are a go/no-go. The product either passes or fails. If a product sample fails, anywhere, any time, an investigation and corrective action may be required by the company or by FDA, or by both....These procedures should not be confused with statistical sampling plans ... extrapolations of results to larger populations are neither specified nor proscribed by the compendia (3).
Quality control procedures, including any final sampling and acceptance plans, need to be designed to provide a reasonable degree of assurance that any specimen from the batch in question, if tested by the USP procedure would comply (4).
Today's reality is a compliance-driven focus rather than a process understanding focus. For example, there is no established procedure to modify release standards to account for increased sample size. Additional data collected to ensure and improve process and product quality may increase the probability of rejecting an acceptable batch. This could potentially result in a delay to a clinical study or product approval, or it may result in a threat to product supply continuity for an ongoing clinical study or to a marketed product. All of these results can adversely affect a patient's access to a vital medicine. This reality appears to be strongly at odds with recent initiatives. The current system favors minimal testing strategies and presents a barrier to creating greater process understanding.
This article discusses some issues surrounding the disconnect between specifications and sample sizes and suggests defining acceptable quality as referring to the true batch parameter (e.g., average, relative standard deviation). Moreover, the statistical phenomenon of multiplicity is defined and its effects in the chemistry, manufacturing, and control (CMC) environment are explored. Situations are described in which this phenomenon can act as a testing disincentive and thus a barrier for the implementation of new and promising US Food and Drug Administration initiatives. Finally, remedies in thinking and procedures to avoid these undesirable effects are suggested.
Multiplicity and error in decision-making
Many statistical decisions are based upon a counterproof of sorts: One makes an assumption, collects data, and determines the probability of obtaining the results actually collected (given the assumption). If this probability is very low, either the results or the assumption must be wrong. Because one has confidence in the experiment, it can be concluded that the assumption was incorrect (because an "unlikely" outcome was obtained).
Typically, to simplify the task, acceptance limits for some computed value (e.g., the sample average, sample standard deviation, or number of results outside 75–125%) are determined in advance, and the outcome of the test is determined by comparing the observed computed value with the limits. If the observation is outside the limits, then the assumption is rejected, otherwise not. The limits are determined so they correspond to a selected low probability of rejecting a correct assumption. The probability used for the decision varies based on the circumstances; however, often probabilities of 0.05, 0.01, or 0.001 (5, 1, or 0.1%) are used. This probability, often referred to as the significance level, also can be thought of as the risk of drawing the wrong conclusion by rejecting an assumption that actually is correct. In quality control for batch release, this is the producer's risk and represents a risk of failing a batch with acceptable quality. As long as there is variation in the data (e.g., as a result of manufacturing or measurement process variability), there will always be a non-zero risk of rejecting a correct assumption.
Although this type of error cannot be completely removed, widening the acceptance limits will reduce it. This change, however, has the undesirable effect of increasing the risk of accepting a batch of inferior quality. In quality control for batch release, this is called the consumer's risk. Maintaining a low consumer's risk is a key goal of both regulators and the pharmaceutical industry. A major challenge in decision-making is to find a suitable balance between these two risks. One of the advantages of increased testing is that it allows for the simultaneous reduction of both types of risks, provided acceptance criteria are well designed.
The final decision about appropriate batch quality is made by using different tests, with the proper sample sizes and acceptance criterion, to protect both the consumer and producer. When these tests are performed more frequently than prescribed without adjusting the decision rule, the producer's risk increases. This phenomenon is called multiplicity. Moreover, this increase in producer's risk often occurs without any significant associated reduction of the consumer's risk. Further, if the producer's risk is too large, the manufacturing costs cannot be recovered, and a potentially beneficial drug product may not be made available to patients.
Contemporary examples of multiplicity issues
This section illustrates typical situations in which multiplicity acts as a disincentive to collecting more data and thereby causes poor decision-making.
Release testing. At batch release, a range of properties judged important for product quality is evaluated to verify acceptable batch performance. It is not uncommon to have 10–20 properties to test, some of which may be associated with multiple acceptance criteria. For some tests, replicates in addition to final (reportable) results may be compared with criteria. In total, the specification might contain more than 30 comparisons with acceptance criteria. A failure of any one of the 30 criteria will result in a failure of the batch being tested. In this case, if each test has a 1% risk of falsely exceeding its acceptance criteria for an acceptable batch, then the multiplicity phenomenon results in as much as a 26% risk of rejecting an acceptable batch as a result of at least one criterion failure (1 – [1 – 0.01]30 = 0.26). As a result, manufacturers are motivated to reduce the number of properties being studied or the amount of testing supporting a given property.
Repeated release is another multiplicity issue related to release testing. For certain product types and territories, the manufacturer's release of a batch must be followed by additional regulatory testing and re-release of the same material. This results in multiple tests when the material additionally must be tested for importation into a region, then possibly tested a third time by local authorities. The situation is exacerbated when the same batch is exported to several countries or territories.
Stability testing. Consider a solution formulation for which the specification requires that the pH be 6.5–7.5. Rather than being based upon clinical fitness for use, these limits were established on the basis of some development data after computing process average ±3σ. This means, assuming data follow a normal distribution, that about 3 out of 1000 results are expected to be outside the specification limits because of pure chance. Now assume that the true batch pH is 7.0 and nearly all observed variability is a result of measurement error. Ten units are tested for a batch at each of seven stability time points. Even if the pH is stable during storage, the risk that at least one unit fails to comply with the specification at some time point is at most 1 – (1 – 0.003)70 = 0.19. Clearly, a 19% risk of observing a false out-of-specification (OOS) result for an acceptable batch on stability is intolerable. This failure to conform to specifications also can lead to product recalls of acceptable material. When false OOS signals are obtained as a result of multiplicity, many resources are wasted in the subsequent investigation looking for a root cause that does not exist.
The overall risk of randomly failing somewhere in the stability program increases as more storage conditions and additional annual batches are studied. If multiplicity effects are not taken into account to manage this risk, then manufacturers could be led to a risk-managing approach of reducing the number of batches, conditions, time points, and tested properties as much as possible. Clearly this is at odds with one of the primary objectives of stability testing: to increase the knowledge of the product.
Shelf-life estimation. The recommended approach to shelf life estimation includes a statistical test of batch slope homogeneity (i.e., determining whether there is evidence of a difference in the rate of change over time across batches) (5). It is well known that increasing either the number of batches, the amount of data per batch, or the precision of the measurement process will increase the probability of rejecting the hypothesis of batch slope homogeneity (6). When estimating product shelf life from a group of nonhomogeneous batch slopes, a worst-case strategy is prescribed such that the shelf life for future batches is based on the batch in the group with the shortest shelf life (5). Even when batch slopes are similar, as the amount or precision of data included in a stability program increases, the conditions prescribed to allow pooling of batches are usually less likely to be fulfilled. Therefore, a shorter shelf life is typically obtained. Thus, even though a larger number of batches could lead to a better understanding of product stability, there is a built-in disincentive for a developer to include more than the minimum number of batches (three) for shelf-life estimation.
Process analytical technology. One of the potential applications of PAT is comprehensive real-time evaluation of tablet content uniformity (nondestructive testing of hundreds or thousands of tablets). A challenge in such applications is to properly define the acceptance criteria appropriate for such a large sample. Traditionally, a sample of 10–30 tablets is assessed against criteria specified in the pharmacopeia for uniformity of dosage units (7). These pharmacopeial criteria are not directly applicable to large sample sizes, because they do not take into account the multiplicity risks associated with large samples. An acceptable batch, which passes the pharmacopeia test a high percentage of the time, may fail to have all PAT samples meet the pharmacopeia acceptance criteria. This failure may be attributed to the fact that the range of observed individual values increases with sample size or because the uncertainty in a computed result is based on a sample of different size than that for which the acceptance limit was developed. This issue has been highlighted as a deterrent for extended applications of PAT in this area (8).
Validation and investigations. Frequently in product development and other experimental investigations, including validation and optimization, a scientist may be interested in detecting smaller differences than would be of interest in routine release and stability testing, where the focus is to verify satisfactory product properties. A statistically established approach would be to collect larger amounts of data, possibly following an alternative sampling plan, than the standard release stability test for increased understanding. Because of multiplicity, however, the risk of obtaining individual results outside limits established for a smaller sample size will then increase as a result of common analytical and sampling variation. Because observing such aberrant values will result in "failed" validation, "verified" OOS, or a reduced design space, the consequence is that the size of the testing program is primarily driven by a risk-to-fail assessment rather than the need to obtain more reliable conclusions.
The impact of multiplicity effects in data-based decision making is well known. Some solutions have been devised in other pharmaceutical disciplines such as the clinical and safety areas as well as genomics and microarray screening (9–11).
Eight fundamental principles for improved decision-making
Eight fundamental principles encourage appropriate planning and evaluation of data to reduce the effect of multiplicity and to improve decision-making in the CMC environment.
Recognize that an observed result is only an estimate. Key in-vitro batch parameters are, or should be, associated with clinical and safety properties or characteristics, including potency and other important quality attributes (e.g., dissolution and preservative levels) as well as the uniformity of the active ingredient among individual dosage units. The fact is that one can never measure the true value of a parameter without error. This error, in conjunction with the multiplicity phenomenon, eventually will result in observing units or samples outside limits. Understanding that the observed result is only an estimate of, and not exactly equal to, the true value is of utmost importance.
Focus on the reliable estimation of batch parameters. Adequately addressing today's dilemma requires a fundamental change in philosophy with respect to the goal of data generation. From a statistical point of view, when fewer data are generated about a parameter, more uncertainty remains. By changing our philosophy to one of gathering the appropriate amount of information, there is a strong incentive to increase data generation. More data permit a more precise estimation of the true value of a parameter. The reduced uncertainty in the knowledge of the underlying batch parameter allows a more informed decision about the acceptability of the level of the associated batch parameter. When possible, the acceptable range of the underlying batch parameter should be determined on the basis of fitness-for-use. Furthermore, the requirements on the results collected should be developed to ensure, with high likelihood, that the batch parameter is within the defined acceptable range.
The acceptance question hereby changes from a simple "Is the observation a go/no-go decision?" to one of "Do I have sufficient confidence that the true, unknown batch parameter is within acceptable limits?" Increased knowledge of a batch parameter obtained from additional sampling and testing provides a clearer understanding of true batch quality and is no longer considered a risk. The risk of failing an acceptance criterion in this new scenario would now be linked to the true batch quality and appropriately tied to the sample size evaluated. This fundamental philosophical change is necessary to make the FDA vision of improved process understanding a successful and sustainable reality for years to come.
Understand the role and function of various types of limits. One must understand the differences among various limits, their appropriate use, and the risk for and consequences of mixing different concepts. One concern is the growing use of data-driven 3σ control chart limits as specification acceptance criteria rather than establishing specification limits that are based upon fitness-for-use, thereby ensuring proper performance of the batch.
According to Tougas, "The approach to setting acceptance criteria for end-product tests is based on a perception of what the process is capable of delivering, not on what limits are required to ensure performance (safety and efficacy). The net result is a tendency toward excessive tests and limits that result in excessive producer's risk (i.e., failing test on a batch with acceptable quality). This results in significant resources expended on activities that do not contribute to the quality of pharmaceuticals" (12).
The distinction between 3σ limits and acceptance criteria is important, because the appropriate consequences of not meeting the two types of limits are very different. The data-driven 3σ limits (also referred to as control chart limits) are tools to alert to a change or drift in a manufacturing process. Therefore, the appropriate action is to investigate a potential special cause and, when appropriate, adjust the process back to its optimum or remove the special cause (13). On the other hand, acceptance criteria, if correctly defined, are used to ensure batch suitability tied to fitness for use. When regulatory agencies require a manufacturer to adopt control-chart limits as acceptance criteria, they force the interpretation of alarm signals as criteria to determine batch disposition. Ultimately, this increases the risk of rejecting acceptable batches. Manufacturers need a margin between alarm signals and acceptance criteria to institute a fully effective risk-based quality system.
Recognize that development and end-product testing have a common goal. Development and end-product testing have a common goal: to ensure a manufacturing system produces product with characteristics within the safety and efficacy requirements. Testing must be viewed not as a risky activity required for compliance but as part of a scientific decision-making process.
Development activities support this goal by identifying the optimal settings and allowable variation around these (i.e., the design space) for various manufacturing process parameters. In-process control tests are developed to ensure that the process will remain robust to variation of incoming materials and manufacturing conditions. End-product test data can be used in a control system to holistically monitor and guarantee system performance and to control the consumer's risk.
Link sample size and acceptance criteria to manage risk. If sample size is not properly addressed in acceptance criteria, the increased information carries with it increased risk. The quality of information is directly related to the amount of data and the precision of the measurement tool. Acceptance criteria should acknowledge the risks to the manufacturer and to the customer. Consequently, they should be adjusted whenever there is a change to the amount of data being collected or the precision of the measurement device.
End-product testing is a type of acceptance sampling. Acceptance sampling is an established quality control tool with a firm statistical basis (14). A key premise of acceptance sampling is that the risk of not meeting an acceptance criterion should depend upon the batch quality and not be based upon the sample size evaluated. With traditional statistical acceptance sampling plans, the acceptance criteria vary as a function of sample size or of the number of tests conducted on a given batch. This is done to maintain established producer- and consumer-risk levels. Similarly, the acceptance limits for end-product tests should depend on sample size. Such situations may arise when nontraditional methods are applied to batch monitoring such as PAT. These methods may use sample sizes that are much larger than those of traditional release tests, thus providing much better estimates of true batch characteristics. For example, a statistically based approach that better characterizes the batch quality, while adequately controlling the risks and allowing for varying sample sizes, has been proposed (15).
Recognize the value of additional testing. In many cases, data collected for continuous product development, improvement, and investigation should not be subjected to the same acceptance criteria applied to end-product testing. Additional data improve the precision in estimating the true level of a parameter, resulting in more informed decisions. Testing larger numbers of samples generally provides additional knowledge of batch parameters such as average and RSD, or measures of stratification and trends. However, multiplicity leads to a penalty for companies that attempt to use larger sample sizes for testing that must meet end-product testing criteria.
As part of an OOS investigation, it may be desirable to obtain additional test results from the batch in question or from other batches made using the process in an attempt to gain further insight about the batch. Furthermore, as part of a larger continuous process-improvement effort, additional data collection may be considered an extension of process-development activities. Additional data are typically required to make a better informed risk-based decision about a batch or an overall process.
Although it has sometimes been said that additional data may be used to "test into compliance," manufacturers should not be discouraged to acquire such additional data in the interest of continuous process understanding. The knowledge obtained from these data may lead to improved processes with increased quality of produced batches as the ultimate goal.
Use averages where appropriate. It is important to understand that data-driven decisions surrounding traditional CMC business objectives such as stability testing, validation, and analytical investigations can be improved by the appropriate use of averaging rather than comparing each replicate with a specification. Additional test results should be used to obtain better estimates of the true batch characteristics by averaging or other data summarizing techniques. Although each individual test result estimates the true batch potency, the average of the results is a better estimator because the uncertainty in the estimate is reduced as the number of samples increases. Therefore, the average should be used to assess the batch's fitness for use and is the most appropriate quantity to compare against the specification. Thus, when the goal is estimation of batch characteristics, averaging will facilitate more informed and risk-based assessments of the true value for an analytical property.
Make effective use of data through proper statistical analysis. Inappropriate interpretation of the data can result in increased risk to both manufacturers and customers. A proper statistical analysis of data relies on a clear statement of the objective and a statistical design that addresses the quality goal with appropriate attention to risk.
Thus for stability testing, if the goal is to estimate a characteristic of the batch throughout the shelf life of the product, regression analysis can be performed to estimate the product characteristic with better precision than looking at individual measurements, while the analysis can also be used to forecast that characteristic and predict potential product failure. The predicted value from a regression analysis uses the power of all measurements made on the batch, thus providing a superior estimate of the batch average and its performance over time.
In a product optimization, a process validation, or an OOS investigation, similar issues are present. A proper analysis of data stemming from a multifactor experimental design can facilitate the acquisition of key process information.
Using statistical analysis for addressing multiplicity is standard practice in the clinical and preclinical testing of a product. Well-known examples include the definition of a primary end-point for clinical trials and the multiplicity correction of p-values used by safety toxicology assessment. Similar adjustments can be made in the CMC environment. For example, statistical concepts could be used to mitigate the risk of incorrectly concluding nonhomogeneous slopes among stability batches in shelf-life estimation to obtain a more accurate estimate of shelf life (16, 17). In addition, multiplicity adjustment might be performed in setting acceptance criteria for multiple release parameters or in establishing a shelf-life specification, where multiple measurements will be made on a batch throughout its shelf life.
As industry begins to embrace the new quality paradigm for the 21st century introduced by FDA, manufacturers will obtain more data. Additional samples will be taken for current tests, and additional parameters will be tested to obtain a better understanding of products and processes. When large amounts of additional data are gathered, the underlying multiplicity issue creates a problem for the current system of specifications with zero tolerance for results outside limits (i.e., a go/no-go approach). This makes it increasingly difficult to focus on the true underlying quality of batches. Acting in the traditional manner will cause many OOS investigations to look for special causes when none exist. The degree and frequency of nonconforming results must be considered as part of the evaluation of the larger sets of data gathered in this new paradigm. This should not be at the expense of understanding the nature of the true batch quality, but rather requires improved estimates of batch parameters. Risk assessments are a part of judicious handling of large data sets, and extended use of statistical techniques and philosophy play a key role in this. There are real costs associated with poor risk management. This is as true when not reacting to real signals as it is when overreacting to false signals.
The emphasis in the new paradigm must change from individual test results to improved estimates of the true batch parameters needed to support quality decisions. The conformance of a singlet determination under such circumstances is neither necessary nor sufficient to characterize whether a batch is poor or failing. Indeed, situations will exist in which batch parameters indicate that processes and products are satisfactory, whereas some individual singlet determinations may not comply on their own. Current requirements outlined in ICH Q1E are an example of this, because decisions are made based upon the batch average and not individual test results (5).
The acceptance criteria and the amount of data should be linked. Together they define the test characteristics to meet the objective of making a decision on batch quality. This must be recognized and continuously emphasized to successfully manage the risks. Examples of this thinking are becoming more common and represent the future desired state (8, 18–21). These ideas must be extended to include the composite testing case (i.e., assays and degradation products) and tests with correlated end-points (e.g., dissolution, particle-size distribution). The batch parameter estimates should take on greater importance than individual test results, and regulatory guidance should be updated to recognize the importance of these estimates.
It must be understood that the only way to guarantee that all units comply with a limit is to test every unit (100% testing) and have a perfect measurement system with no error in testing. A sample from the population, no matter how large, cannot provide this guarantee. Although nondestructive testing can help overcome the limits on the size of samples, it cannot be expected to operate under the same requirements for results (acceptance criteria) as for the small sample case. The current expectation that every test result and replicate determination made must meet acceptance criteria is no longer a useful concept in this environment. In the new paradigm, there should be rewards for developing criteria that allow for improved knowledge about the quality of products and processes. The penalties associated with historical practices must be removed.
Finally, in many cases, process understanding is best achieved through proper statistical analysis of the data. The statistical tools that are widely used in research and development are equally useful in the CMC setting. The statistical link between sample size and reliability creates an incentive, rather than a disincentive, for collecting more data.
Laura Foust* is a research scientist at Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN 46285, tel. 317.276.3007, fax
*To whom all correspondence should be addressed.
Submitted: Jan 31, 2007. Accepted:Feb. 26, 2007
1. US Food and Drug Administration, Pharmaceutical CGMPs for the 21st Century—A Risk-Based Approach (Sep. 2004), www.fda.gov/cder/gmp/gmp2004/GMP_finalreport2004.htm.
2. ICH, Q8: Pharmaceutical Development (Step 5, Feb. 2003).
3. L. Torbeck, "In Defense of USP Singlet Testing," Pharm.Technol. 28 (2), 105–106 (2005).
4. J.R. Murphy and K.L. Griffiths, "Zero-Tolerance Criteria Do Not Assure Product Quality," Pharm. Technol. 30 (1), 52–60 (2006).
5. ICH, Q1E: Evaluation for Stability Data (Step 5, Nov. 2005).
6. C. Wen-Jen and Y. Tsong, "Significance Levels for Stability Pooling Test: A Simulation Study," J. Biopharm. Stat. 13 (3), 355–374 (2003).
7. USP General Chapter ‹905› "Uniformity of Dosage Units," USP 28–NF 23 (USP, Rockville, MD 2005), 2505–2510.
8. D. Sandell et al., "Development of a Content Uniformity Test Suitable for Large Sample Sizes," Drug Information J. 40 (3), 337–344 (2006).
9. L. Toothaker, "Multiple Comparisons for Researchers," ICH E9 Points to Consider 49595, Fed. Regis. 63 (179) (Sept. 16, 1998, Notices, Sage Publications, Newbury Park, CA).
10. M.A. Black and R.W. Doerge, "Calculation of the Minimum Number of Replicate Spots Required for Detection of Significant Gene Expression Fold Change in Microarray Experiments," Bioinformatics 18 (12), 1609–1616 (2002).
11. Committee on Professional Ethics, Ethical Guidelines for Statistical Practice of the American Statistical Association, Section II.A.8 (1999), www.amstat.org.
12. T. Tougas, "Considerations of the Role of End Product Testing in Assuring the Quality of Pharmaceutical Products," J. Process Analytical Technology 3 (2), 13–17 (2006).
13. D. Montgomery, Introduction to Statistical Process Control (John Wiley & Sons, 2d ed., New York, NY, 1991), p.105.
14. ANSI–ASQC Z1.4-1993, American National Standard: Sampling Procedures and Tables for Inspection by Attributes (American Society for Quality Control, Milwaulkee, WI, 1993).
15. D. LeBlond, T. Schofield, and S. Altan, "Revisiting the Notion of Singlet Testing Requirements," Pharm. Technol. 29 (6), 85–86 (2005).
16. S. Ruberg and J. Stegeman, "Pooling Data for Stability Studies: Testing the Equality of Batch Degradation Slopes," Biometrics 47, 1059–1069 (1991).
17. S. Ruberg and J. Hsu, "Multiple Comparison Procedures for Pooling Batches in Stability Studies," Technometrics 34 (4), 465–472 (1992).
18. IPAC–RS, "A Parametric Tolerance Interval Test for Improved Control of Delivered Dose Uniformity of Orally Inhaled and Nasal Drug Products" (2001), http://ipacrs.com/PDFs/IPAC-RS_DDU_Proposal.PDF.
19. PQRI BUWG, "The Use of Stratified Sampling of Blend and Dosage Units to Demonstrate Adequacy of Mix for Powder Blends," PDA J. Pharm. Sci. Technol. 57 (2), 64 –74 (2003).
20. W. Hauck et al., "Oral Dosage Form Performance Tests: New Dissolution Approaches," Pharm. Res. 22 (2), 182–187 (2005).
21. R. Williams et al., "Content Uniformity and Dose Uniformity: Current Approaches, Statistical Approaches, and Presentation of an Alternative Approach, with Special Reference to Oral Inhalation and Nasal Drug Products," Pharm. Res. 19 (4), 359–366 (2002).