Application of the Weisberg t-test for Outliers

October 1, 2003
Robert J. Seely, Louis Munyakazi, Thomas F. Curry, Heather Simmerman, W. Heath Rushing, John Haury
Pharmaceutical Technology Europe
Volume 15, Issue 10

Determining whether a data point is an "outlier" - a result that does not fit, is too high or too low, is extreme or discordant - is difficult when using small data sets, such as the data from three, four or five conformance runs. In this article, the authors demonstrate that the Weisberg t-test is a powerful tool for detecting deviations in small data sets.

The attempt to define an "outlier" has a long and diverse history. Despite many published definitions, statisticians in all fields are still interested in objectively determining whether a data point is consistent with the rest of the data set or is, after all, an outlier, signifying a deviation from the norm. For biopharmaceutical companies, the need to evaluate whether a data point is an outlier - inconsistent with the rest of a small set of data - is important in validating process consistency.

The power of a statistical tool increases as sample size (n) increases. So, low power outlier tests used on small data sets, such as the production data derived from 3-5 conformance runs, have relatively high beta (β) or Type 2 errors. These errors indicate there is a high chance of leaving deviant results undetected, making these tests inappropriate for pharmaceutical or biopharmaceutical applications.

The z-test is the most powerful outlier test (the most able to detect a discordant datum) if the data are normally distributed and the standard deviation is known or can be accurately estimated. But z-tests usually require large data sets. Conformance runs from early commercial lots usually produce small data sets, and the standard deviation of the population is not usually known. We show, using representative biopharmaceutical process validation data, that the Weisberg t-test is a powerful outlier test at small values of n. It has a low β error rate in detecting deviations from the mean. Therefore, the Weisberg t-test is suitable for objectively demonstrating consistency of production data.

Process validation data

During process validation, process consistency is typically demonstrated in 3-5 conformance runs. When historical data are available, even if those data are from a different production scale, statistical comparisons are relatively straightforward. Those data can be used to set acceptance criteria for validation runs at a new production scale.

Control charts. When approximately 15 lots have been produced at commercial scale, control charts, which illustrate a process and its variation with time, are useful for evaluating process stability. The authors' choice of 15 lots for calculating control limits is a balance between the extreme uncertainty of limits based on few data and the diminishing value of each new data point in further decreasing that uncertainty. An individuals control chart based on 15 lots has 8.9 effective degrees of freedom (df), which is sufficient to reduce the uncertainty in the limits to approximately plus/minus 23%. Achieving plus/minus 10% uncertainty requires approximately 45 degrees of freedom, which requires more than 70 individual values.1

Outlier tests and errors. Until there are 15 lots, the most useful method to statistically evaluate a data point that seems to be anomalous is the Weisberg t-test.2,3 The Weisberg t-test can be used for data sets larger than 15 values as well. The other tests evaluated for application in small data sets were the Dixon4 and the Grubbs.5 The latter is also known as the Extreme Studentized Deviate (ESD).6 The Weisberg t-test can distinguish between normal process variation and a process aberration that yields an outlier. For example, at an alpha (α) error (calling something an outlier when it is not one) of 0.05, the b error is only 0.31, and thus the Weisberg t-test is the most powerful test available for small data sets among those considered (Figure 1).

Figure 1: The operating characteristic curves for a variety of outlier tests are created by repeatedly drawing four values from a population of known mean and S, with a fifth value taken from a population with a known shift in mean.

Discordant observations

Until recently, the US Pharmacopeia (USP) did not address the treatment of chemical test data containing discordant observations. This "silence" was interpreted to mean a "prohibition" during the US versus Barr Laboratories, Inc. case.7 The judge's ruling in that case indicated the need for such guidance8 and in 1999, a new monograph was previewed in Pharmacopeial Forum.9 That monograph states that when appropriately used, outlier tests are valuable tools for analysing discordant observations.

The discussions in the Barr case and the USP monograph advocate using an outlier test to disregard a data point. In this article, the authors use the Weisberg t-test to objectively identify an outlier as part of a statistical evaluation of small data sets. For process validation purposes, if the Weisberg t-test identifies no outliers, the data can be claimed to be consistent based on an objective statistical method.

This article describes the application of the Weisberg t-test to data from five conformance runs. The authors examine the ability of the test to demonstrate process consistency. Subsequent uses of this test would include checking a suspect data point from lot six with the previous five, or lot seven from the previous six, for instance. The Weisberg t-test could also be used during a retrospective review of data sets. For example, an earlier value may stand out as a possible outlier, but only after subsequent data show a pattern that distinguishes it as a possible outlier. As standard practice, the authors advocate an investigation of the causes of such statistical differences.

At 15 data points, the individuals control chart for each point becomes the preferred tool for detecting discordant observations and for showing process consistency (defined as the absence of discordant observations). If the data are available in subgroups, then an averages control chart is preferred.

Testing for a single outlier

In this article, an outlier refers to a datum that appears not to belong to the same group as the rest of the data; that datum measurement may seem to be either too large or too small in relation to the general pattern of the rest of the data. The method proposed applies to a single outlier, and is similar to the traditional t-calculated (t

calc

) form of the general t-test statistic (Equation 1).10,11

The test hypothesis (Ho or null) can be stated as: the suspected value is not an outlier. Its alternative (Ha or alternative) is stated as: the suspected value is an outlier.

Working with reduced data. The entire set of data should not be used to estimate the standard error (SE). Such estimates would be biased if the suspected outlier were included. The estimate of variation would be inflated and the estimate of the arithmetic mean would be biased toward the outlier.

The logic of the Weisberg t-test. After computing the estimates without the suspected outlier, the Weisberg t-test statistic for the suspected outlier (denoted by yi) is given in Equation 2

where n is the sample size, y-1 denotes the computed sample mean, s-1 is its standard deviation after the withdrawal of y1, the suspect outlier.

The logic of the Weisberg t-test is that the numerator (y1 - y-1 ) compares the mean value y-1 with the suspected outlier value y1. Furthermore, the denominator s-1 is the classic sample standard deviation.

In Equation 2, the factor denoted by

adjusts the calculated t-value (tcalc) downward and is more conservative for small samples. Specifically, the above factor is identical to

where h is the leverage matrix;3 that is

because the data are reduced by one observation. The estimated SE of the mean y-1 is

which makes Equation 2 equivalent to

The tcalc> (as found above) is then compared with percentiles of a t-distribution at the α significant level with (n=22) degrees of freedom.

Probability and degrees of freedom. Table I shows the t-critical (tcrit) values at three different a values, in which the df are two less than the sample size. If the absolute value of tcalc is less than tcrit, the point is not an outlier. Table I can be generated by a spreadsheet such as Excel for other values of α (alpha error rate) and degrees of freedom using the inverse of the Student's t-distribution (TINV) function. The TINV function requires two arguments: "the probability associated with a one-tailed t-distribution" and the "degrees of freedom." The critical t-values in the body of Table I are derived using the Excel function: TINV(reference to value at the top of the column as a proportion, times two; and the value for the degrees of freedom from the first column where degrees of freedom is two less than the number of samples). In Excel, the TINV function gives the t-value for two tails, placing one-half of the a value in each tail, whereas the Weisberg t-test is a one-tail test. That is why the α values must be multiplied by two when using the TINV spreadsheet function. When using a published two-sided Student t-table, the results are obtained by shifting one column to the right; that is, by using (n22) degrees of freedom, as shown in Table I.

Table I: The tcrit values for three different levels of a errors and the degrees of freedom (two less than the sample size).

Identifying a biotech outlier

As a practical example, the authors used a data set from monitoring a large chromatography column in a recombinant protein purification process. Table II presents a representative set of such data. Step yield (per cent recovery), target protein concentration, a purity assay, host cell protein (HCP) concentration and processing time are the primary indicators of step consistency. The data appear to be consistent across the five lots, except in Lot 3 - the HCP is apparently high and might be inconsistent with the other four data points.

The Weisberg tcalc for HCP in the authors' example is 1.796. Comparing that number with the tcrit values (Table I), for n=55, α=0.05, the tcalc is less than the tcrit (2.353), therefore, the data point is not an outlier and the five data points are consistent. So the subjective judgment used to decide that the data point might be discordant is followed by the application of a statistical tool to provide an objective assessment.

Table II: A representative set of data from the first five lots of a purification process for a recombinant protein in a large chromatography column; the parameters are the primary indicators of step consistency.

Choosing an α value of 0.05 means that when the process actually has no outliers, the authors are willing to accept a 5% chance of a false positive - a 5% chance that a point identified as discordant by the Weisberg t-test is not, actually, an outlier, accepting that rate means accepting unnecessary investigations 5% of the time. If the α value is reduced to avoid those investigations, the #apos value rises, which means false negatives - the test fails to identify a discordant value. In the authors' application, a β error occurs when the test fails to detect an outlier when one is actually present. The authors choose to set α at 0.05 and are willing to perform more frequent investigations (as a result of false positives) to keep the β error rate low. At α50.05, n=55, the β error is a reasonable 0.31 for detecting a shift of three standard deviations (Figure 1). That rate is significantly less than the more commonly used outlier tests (Dixon and Grubbs), which yield β= errors of approximately 0.77 (Figure 1).4,5

The set of operating characteristic (OC) curves in Figure 1 shows a variety of outlier tests constructed by simulating a data set of 5000 from which four samples were drawn. A fifth datum was randomly taken from a data set, which was shifted by a given number of standard deviations. To detect a standard deviation change of three, the z-test (the basis for control charts) is clearly the most sensitive, with a β error of 0.08. For the purposes of comparing outlier tests, the z-test is presented here as a one-sided test; the two-sided z-test is the basis for control charts. The control chart, however, requires a large data set or a good estimate of the variance of the data. When those conditions are available, an individuals control chart (or a control chart of averages) is the recommended method. When a large data set or a good estimate of the variance is not available - for five conformance runs with little relevant data from previous scales, for example - the Weisberg t-test is the next best available tool.

The Bonferroni correction is used for some statistical comparisons. For example, it is used for multiple comparisons (the "family" of comparisons) by dividing the Type 1 error among all comparisons, so that the overall Type 1 error rate of the family does not exceed a desired level. In this article, a single hypothesis test for one visually suspected outlier is used, rather than testing a hypothesis of no outliers by performing multiple tests on every data point versus the remaining set. Because multiple comparisons are not contemplated in this example, the Bonferroni correction is not used.

One-sided or two. A final point must be made regarding the one-sided versus the two-sided Weisberg t-tests for outliers. Because an outlier is initially detected as being the farthest from the central tendency (the mean) of the data, the outlier will be either higher or lower than the mean. The Weisberg t-test determines whether the "outlier" is larger if it is to the right of the mean (on a number line) or smaller if it is to the left of the mean (on a number line). The test does not show the differences without reference to the direction of that difference; therefore, the Weisberg t-test is a one-sided test, and the resulting tcrit values in the table must reflect that.

Alternative methods. Two other methods can be used to obtain identical tcalc values. One uses regression (Alternative 1 in Table III), and one uses analysis of variance (ANOVA) (Alternative 2 in Table III). These methods are discussed in the "Alternative methods for determining tcalc" sidebar and the results from those tests are compared with the Weisberg tcalc values in Table III.

Table III: A comparison of the Weisberg t-test with two other methods for obtaining identical tcalc values: the regression-based method (Alternative 1) and the ANOVA-based method (Alternative 2); the host cell protein (HCP) observations for testing step consistency are the responses being tested.

A superior tool

The Weisberg t-test has a low β error rate (particularly when used with a higher a error rate) for small data sets. It is a superior, objective tool for showing consistency within small data sets. As shown in this example, the test fits the needs for evaluating biotechnology process data.

The Weisberg t-test can be applied to determine the internal consistency of small data sets and can also be useful in process validation. When validating a process, a protocol with preapproved acceptance criteria is required. For key performance parameters, numerical limits for specific attributes must be defined and met. Typically, however, many secondary parameters may not have predefined numerical limits, but they are still expected to be internally consistent during the validation runs. For example, during scale-up, the mean of a given output parameter can shift up or down, but if that does not affect product quality, the variation may be perfectly acceptable. To validate that a process is performing consistently, the values of that parameter should be similar for three to five runs. The Weisberg t-test is a useful tool that adds statistical objectivity to the claim that a process is "consistent."

Anova Table.

References

1. D.J. Wheeler, Advanced Topics in Statistical Process Control: The Power of Shewhart's Charts (SPC Press, Knoxville, Tennessee, USA, 1995) p 185.

2. R. Brandt, "Comparing Classical and Resistant Outlier Rules," J. Am. Stat. Assoc. 85, 1083-1090 (1990). Note: The error in the formula printed in this reference was corrected in T.F. Curry, "Corrections," J. Am. Stat. Assoc. 96(456), 1534 (2001).

3. S. Weisberg, Probability and Mathematical Statistics: Applied Linear Regression, 2nd Edition (John Wiley & Sons, New York, New York, USA, 1985).

4. W.J. Dixon, "Processing Data for Outliers," Biometrics 9, 74-89 (1953).

5. F.E. Grubbs, "Procedures for Detecting Outlying Observations in Samples," Technometrics 11, 1-21 (1968).

6. T.A. Bancroft, "Analysis and Inference for Incompletely Specified Models Involving the Use of Preliminary Test(s) of Significance," Biometrics 20, 427-442 (1964).

7. United States versus Barr Laboratories, Inc., 812 F. Supp. 458 (DNJ 1993).

8. S.S. Kuwahara, "Outlier Testing: Its History and Applications," BioPharm 10(4) 64-67 (1997).

9. "General Information: <1010> Analytical Data - Interpretation and Treatment," Pharmacopeial Forum 25(5), 8900-8909 (1999).

10. N.R. Draper and H. Smith, Applied Regression Analysis (John Wiley & Sons, New York, New York, USA, 1998) p 4.

11. R.D. Cook and S. Weisberg, Applied Regression Including Computing and Graphics (John Wiley & Sons, New York, New York, USA, 1999).

12. S.R. Searle, Linear Models (John Wiley & Sons, New York, New York, USA, 1971).

13. SAS Institute, Inc., SAS/STAT User's Guide, Version 8.01 (Cary, North Carolina, USA, 1999).