Consider a solution formulation for which the specification requires that the pH be 6.5–7.5. Rather than being based upon
clinical fitness for use, these limits were established on the basis of some development data after computing process average
±3σ. This means, assuming data follow a normal distribution, that about 3 out of 1000 results are expected to be outside the
specification limits because of pure chance. Now assume that the true batch pH is 7.0 and nearly all observed variability
is a result of measurement error. Ten units are tested for a batch at each of seven stability time points. Even if the pH
is stable during storage, the risk that at least one unit fails to comply with the specification at some time point is at
most 1 – (1 – 0.003)70 = 0.19. Clearly, a 19% risk of observing a false out-of-specification (OOS) result for an acceptable batch on stability
is intolerable. This failure to conform to specifications also can lead to product recalls of acceptable material. When false
OOS signals are obtained as a result of multiplicity, many resources are wasted in the subsequent investigation looking for
a root cause that does not exist.
The overall risk of randomly failing somewhere in the stability program increases as more storage conditions and additional
annual batches are studied. If multiplicity effects are not taken into account to manage this risk, then manufacturers could
be led to a risk-managing approach of reducing the number of batches, conditions, time points, and tested properties as much
as possible. Clearly this is at odds with one of the primary objectives of stability testing: to increase the knowledge of
The recommended approach to shelf life estimation includes a statistical test of batch slope homogeneity (i.e., determining
whether there is evidence of a difference in the rate of change over time across batches) (5). It is well known that increasing
either the number of batches, the amount of data per batch, or the precision of the measurement process will increase the
probability of rejecting the hypothesis of batch slope homogeneity (6). When estimating product shelf life from a group of
nonhomogeneous batch slopes, a worst-case strategy is prescribed such that the shelf life for future batches is based on the
batch in the group with the shortest shelf life (5). Even when batch slopes are similar, as the amount or precision of data
included in a stability program increases, the conditions prescribed to allow pooling of batches are usually less likely to
be fulfilled. Therefore, a shorter shelf life is typically obtained. Thus, even though a larger number of batches could lead
to a better understanding of product stability, there is a built-in disincentive for a developer to include more than the
minimum number of batches (three) for shelf-life estimation.
Process analytical technology.
One of the potential applications of PAT is comprehensive real-time evaluation of tablet content uniformity (nondestructive
testing of hundreds or thousands of tablets). A challenge in such applications is to properly define the acceptance criteria
appropriate for such a large sample. Traditionally, a sample of 10–30 tablets is assessed against criteria specified in the
pharmacopeia for uniformity of dosage units (7). These pharmacopeial criteria are not directly applicable to large sample
sizes, because they do not take into account the multiplicity risks associated with large samples. An acceptable batch, which
passes the pharmacopeia test a high percentage of the time, may fail to have all PAT samples meet the pharmacopeia acceptance
criteria. This failure may be attributed to the fact that the range of observed individual values increases with sample size
or because the uncertainty in a computed result is based on a sample of different size than that for which the acceptance
limit was developed. This issue has been highlighted as a deterrent for extended applications of PAT in this area (8).
Validation and investigations.
Frequently in product development and other experimental investigations, including validation and optimization, a scientist
may be interested in detecting smaller differences than would be of interest in routine release and stability testing, where
the focus is to verify satisfactory product properties. A statistically established approach would be to collect larger amounts
of data, possibly following an alternative sampling plan, than the standard release stability test for increased understanding.
Because of multiplicity, however, the risk of obtaining individual results outside limits established for a smaller sample
size will then increase as a result of common analytical and sampling variation. Because observing such aberrant values will
result in "failed" validation, "verified" OOS, or a reduced design space, the consequence is that the size of the testing
program is primarily driven by a risk-to-fail assessment rather than the need to obtain more reliable conclusions.