Pitfalls in Statistics - Pharmaceutical Technology

Latest Issue
PharmTech

Latest Issue
PharmTech Europe

Pitfalls in Statistics
The hardest errors to spot are the ones that don't look like errors at all.


Pharmaceutical Technology
Volume 35, Issue 4, pp. 40-42


Lynn D. Torbeck
We all have a scrupulous eye for subjects we are passionate about. Foodies insist on using the most authentic ingredients for their favorite dishes. Classic-car collectors require the smallest details to be as close to the original as possible. And statisticians cajole nonstatisticians to avoid classic pitfalls in applied statistics. In this latter case, however, it is not a matter of taste or authenticity. Incorrect statistical practices can result in erroneous calculations and poor conclusions. Some errors are small, but others can be monumental, and since we never know which way the apple will fall, we should treat them all the same.

Although there is always a large scope for error in a statistical project, some mistakes are more common than others. Those that appear to be correct on the surface are what we call pitfalls.

The most common and deadliest pitfalls

This section highlights some of the most common challenges facing statisticians.

Reportable values. Reportable value or result is not defined for the data and the analysis (1). By definition, the reportable value is the end result of the complete measurement method as documented. It is the value compared with the specification and the official value most often used for statistical analysis. If different people or departments have different definitions, confusion reigns and out-of-tolerance and out-of-specification investigations multiply.

Averages. The average of a set of averages is correct only if the sample sizes are the same. Otherwise, the averages are weighted by the sample sizes (2). In addition, avoid averaging standard deviations even when the sample sizes are the same. The variance is the standard deviation squared. Variances can be averaged when the sample sizes are the same. If the sample sizes are not the same, then a weighting formula is used (2).

The percentage relative standard deviation. The percentage relative standard deviation (%RSD) is not a substitute for the standard deviation because they measure different aspects of variation. Report both with the sample size. Also, avoid trying to average %RSDs or calculate %RSD on data expressed as percentage recovery. The data is already a percentage, so the average and the standard deviation will also be percentages.

Sample size. Always report the sample size. Remember the famous rat study where one third of the rats got better, one third got worse and the third rat ran away?

Summaries. Avoid gratuitous summary statistics without a clear purpose; they cloud the interpretation. Likewise, printing out massive lists of all possible summary statistics of all possible sets, subset and sub-subsets isn't worth the paper it is printed on.

Values and ranks. Give absolute values when looking at relative changes. For example, 2 out of 100 million versus 1 out of 100 million is a 100% increase, but it is still only 2 out of 100 million. Trust your reader to understand the practical implications of the data.

Ranking anything without giving absolute values and/or some sense of comparison to practical importance can be misleading. For example, consider that we ranked schools using a metric that results in one school being at the bottom of the list and another school at the top. But then, we realize both schools produced Nobel Prize winners. Does the ranking therefore have any meaning?

Charts and graphs. Avoid pie charts unless your goal is to deliberately confuse your reader. Graph the data before starting a formal statistical analysis. Common graphs include histograms, time plots, and scatter plots. Attempt to get cause and effect on the same page (3).

Underlying data. Attempt to determine the underlying distribution of the data before starting a formal statistical analysis. While the normal distribution is the most common, many other distributions, such as the log-normal, are also found.

If the data are symmetrical around the average, use the average and the standard deviation. If the data are skewed, the median and interquartile range is preferred. This is not a hard and fast rule, just good practice.


ADVERTISEMENT

blog comments powered by Disqus
LCGC E-mail Newsletters

Subscribe: Click to learn more about the newsletter
| Weekly
| Monthly
|Monthly
| Weekly

Survey
What role should the US government play in the current Ebola outbreak?
Finance development of drugs to treat/prevent disease.
Oversee medical treatment of patients in the US.
Provide treatment for patients globally.
All of the above.
No government involvement in patient treatment or drug development.
Finance development of drugs to treat/prevent disease.
30%
Oversee medical treatment of patients in the US.
9%
Provide treatment for patients globally.
7%
All of the above.
46%
No government involvement in patient treatment or drug development.
7%
Jim Miller Outsourcing Outlook Jim MillerCMO Industry Thins Out
Cynthia Challener, PhD Ingredients Insider Cynthia ChallenerFluorination Remains Key Challenge in API Synthesis
Marilyn E. Morris Guest EditorialMarilyn E. MorrisBolstering Graduate Education and Research Programs
Jill Wechsler Regulatory Watch Jill Wechsler Biopharma Manufacturers Respond to Ebola Crisis
Sean Milmo European Regulatory WatchSean MilmoHarmonizing Marketing Approval of Generic Drugs in Europe
Legislators Urge Added Incentives for Ebola Drug Development
FDA Reorganization to Promote Drug Quality
FDA Readies Quality Metrics Measures
New FDA Team to Spur Modern Drug Manufacturing
From Generics to Supergenerics
Source: Pharmaceutical Technology,
Click here