# Is a Sample Size of n=6 a 'Magic' Number?

Pharmaceutical Technology, Pharmaceutical Technology-06-02-2014, Volume 38, Issue 6

Statistical analysis shows how much testing is needed to deliver a reliable estimate result.

Manufacturers rely upon analytical chemists to carry out analysis and testing of samples to ensure products conform to specifications and to predict the properties of the batch or lot based on the analysis of a sample. Suppose for a moment that a sample is representative of the batch or lot. How much testing of the sample is needed to get a reliable estimate result that is applicable to the batch? This question has financial as well as technical implications. Businesses need--and regulators require--that sufficient testing is done to deliver an answer within a level of confidence.

When testing a number of samples from a batch, analysts usually estimate the value for the batch by taking an average and the confidence in it by calculating its standard deviation. In other words, they adopt a mathematical model that describes the shape of the continuous analytical data and allows them to estimate two parameters that define the distribution: a measure of location (in this instance the mean) and a measure of dispersion (the variance or standard deviation). The distribution for analytical data of this type is usually assumed to be the normal or Gaussian distribution.

Calculating the average isn’t real statistics, is it?
On this basis, the calculation of the humble mean or average takes on a new significance. Consider 10 measurements of a particular analytical parameter of interest, such as the density of a liquid sample. In the example shown in Table I, the average is calculated to be 1.503 and the standard deviation to be 0.0124, tasks easily accomplished with a pocket calculator or Microsoft Excel. How can these calculations relate to a mathematical model?

Table I: Residuals and their squares from the calculated mean density value.

 n Value Residual Residual squared

1

1.498

-0.005

0.00002809

2

1.482

-0.021

0.00045369

3

1.517

0.014

0.00018769

4

1.516

0.013

0.00016129

5

1.508

0.005

0.00002209

6

1.487

-0.016

0.00026569

7

1.516

0.013

0.00016129

8

1.503

0.000

0.00000009

9

1.496

-0.007

0.00005329

10

1.510

0.007

0.00004489

Mean

1.503

0.000

0.00137810

In taking a closer look at the calculations in Table I, subtracting the mean value from each of the n individual values results in a residual. The sum of the residuals is zero, as expected. However, when squaring the residuals and summing them the answer is not zero. This sum of squares of residuals, SSR, is shown in Equation 1:

By use of a little calculus, it is easy to show that, according to the principle of least squares (1), the best estimate of location will be when SSR is at a minimum. Hence, the differential is set to zero (Equation 2),

and, therefore, (Equation 3)

On rearranging, this becomes Equation 4:

It is now apparent that the humble average is the best least squares estimator for the data model, Y = X + ε, which is the measure of location.

In addition, the dispersion can be calculated--in our example, the sample standard deviation, sx--from the square root of the variance (Equation 5), which is simply the sum of squares of residuals divided by the number of independent variables (degrees of freedom) n-1. This calculation can be done because there are not 10 independent pieces, but only 9, as a mean from the data has already been calculated. The standard deviation is the square root of the variance.

However, the question remains: How many determinations are needed?

Normal distribution or the t distribution?
The traditional method for teaching statistics in analytical chemistry tends to focus on the properties of the normal distribution as the underlying data model (Equation 6).

For a population mean, μ, of zero and a population standard deviation, σ, of 1, this simplifies to Equation 7 and is shown as the familiar standard error curve in Figure 1.

The areas under this curve, as shown in Figure 1, give the probability of a value falling within ±1σ, ±2σ, ±3σ of the population mean μ. Because the true values of μ and σ for a real analytical data population are never known, analysts are forced to estimate them from the mean from sample data; X and the sample standard deviation s. If the standard normal distribution is used to calculate the 95% confidence intervals, the error will be underestimated (Equation 8).

This underestimate is because the error distribution actually depends upon the sample size and in analysis, only a small number of data points are typically available. The correct distribution to use where n is less than 30 is the t distribution, not the normal distribution. Therefore, Equation 9 is used, where the 1.96 is replaced by the value from the t distribution at 95% confidence and n-1 degrees of freedom.

The t distribution equation (Equation 10) is a little more complicated than the equation for the normal distribution and involves the number of degrees of freedom ν and the gamma function Γ. The equation is readily available in statistical packages. It is a relatively simple matter to calculate the distribution curves in Excel and is shown in Figure 2 from Equation 10 (2); where ν is the number of degrees of freedom and Γ is the gamma function.

Figure 2: The t distribution for values of n of 3, 6, and 30.

It should be noted that at infinite degrees of freedom the t distribution becomes the standard normal distribution. As can be seen at values of n=30 and above, the t distribution is a good approximation to the normal distribution.

Is there an optimum number of determinations, n, which will provide businesses and regulators a reliable answer within a level of confidence? The answer is described in the next section.

Confidence interval of the mean
First, examine the requirements for the mean value. What is the effect of n in calculating the confidence interval for a mean value? One way to do this is to calculate the 95% confidence limits in terms of multiples of the sample standard deviation, s, so as to understand the effect of n.

This calculation is shown in Figure 3. It is apparent that the 95% confidence limits in terms of multiples of s rapidly tighten until n reaches 6. Expending more analytical effort, and hence cost, to obtain a factor of 2 improvement would need an additional 12 determinations (n=18). The magic number of n=6 is known as the ICH compromise (3), whereby the 95% confidence limit of the mean is approximately ±s. For this reason, the number of retests carried out as part of out-of-specification investigations to isolate an outlier should be 5 or more (4).

Figure 3: Effect of the number of samples, n, on the 95% confidence interval of the mean expressed as multiples of the standard deviation.

Is n=6 a magic number for the standard deviation as well? The answer is yes, for the reasons described in the next section.

Confidence interval of the standard deviation
An approach similar to what was used for the confidence interval of the mean can be applied for the confidence interval of the standard deviation. The 95% confidence interval can be expressed in terms of multiples of the sample standard deviation as a function of the number of data points n. This time, however, remember that the standard deviation is not subject to the t or the normal distribution but by another distribution, the chi squared, χ2, distribution. Again it is a relatively simple matter to calculate the distribution curves in Excel. The upper and lower confidence limit distributions (5) are not symmetrical in this instance and are given by Equation 11 and Equation 12 shown in Figure 4.

From Figure 4, the upper 95% confidence interval in terms of multiples of s rapidly tightens until n reaches about 6 as was seen with the confidence of the mean. Expending more analytical effort, and hence cost, adds little advantage. Even if an additional 24 determinations (n=30) were made, it would only improve the 95% confidence interval by a factor of 1.6. The magic number of n=6 gives the 95% confidence limit of the standard deviation is approximately twice s. If only n=3 was used to calculate the value for s the upper 95% confidence interval would be about 4.3s, which is far too large. If s was 1.5, we would only be 95% confident that it was less than 6.5; whereas if n=6 was used, it would be approximately 3.

Figure 4: Effect of the number of samples, n, on the 95% confidence interval of the standard deviation expressed as multiples of itself.

Conclusion
For all statistical calculations, including calculating a mean and standard deviation, analysts need to adopt a data model and associated distribution. For small number statistics, sample size n≤30, this distribution is the t distribution, not the normal distribution. The t distribution takes into account the sample size when calculating the number of degrees of freedom. Based upon the calculation of 95% confidence intervals as a function of the sample standard deviation, s, it was shown that the choice of n is 6--the ICH compromise--that allows the balance between business needs and regulatory requirements to be met.

References
1. W. Edwards Deming, Statistical Adjustment of Data, (Dover, NY, 1964)
2. J.W. Harris and H. Stocker, Handbook of Mathematics and Computational Science (Springer, NY, 1998, p. 81).
3. ICH, Q2(R1) Validation Of Analytical Procedures: Text And Methodology, (2005).
4. C. Burgess and B. Renger, ECA Standard Operating Procedure 01, Laboratory Data Management; Out of Specification (OOS) Results, Version 2, (August 2012).
5. Confidence Intervals for Variances and Standard Deviations, www.milefoot.com/math/stat/ci-variances.htm.