We all have a scrupulous eye for subjects we are passionate about. Foodies insist on using the most authentic ingredients
for their favorite dishes. Classic-car collectors require the smallest details to be as close to the original as possible.
And statisticians cajole nonstatisticians to avoid classic pitfalls in applied statistics. In this latter case, however, it
is not a matter of taste or authenticity. Incorrect statistical practices can result in erroneous calculations and poor conclusions.
Some errors are small, but others can be monumental, and since we never know which way the apple will fall, we should treat
them all the same.
Lynn D. Torbeck
Although there is always a large scope for error in a statistical project, some mistakes are more common than others. Those
that appear to be correct on the surface are what we call pitfalls.
The most common and deadliest pitfalls
This section highlights some of the most common challenges facing statisticians.
Reportable values. Reportable value or result is not defined for the data and the analysis (1). By definition, the reportable value is the end
result of the complete measurement method as documented. It is the value compared with the specification and the official
value most often used for statistical analysis. If different people or departments have different definitions, confusion reigns
and out-of-tolerance and out-of-specification investigations multiply.
Averages. The average of a set of averages is correct only if the sample sizes are the same. Otherwise, the averages are weighted by
the sample sizes (2). In addition, avoid averaging standard deviations even when the sample sizes are the same. The variance
is the standard deviation squared. Variances can be averaged when the sample sizes are the same. If the sample sizes are not
the same, then a weighting formula is used (2).
The percentage relative standard deviation. The percentage relative standard deviation (%RSD) is not a substitute for the standard deviation because they measure different
aspects of variation. Report both with the sample size. Also, avoid trying to average %RSDs or calculate %RSD on data expressed
as percentage recovery. The data is already a percentage, so the average and the standard deviation will also be percentages.
Sample size. Always report the sample size. Remember the famous rat study where one third of the rats got better, one third got worse and
the third rat ran away?
Summaries. Avoid gratuitous summary statistics without a clear purpose; they cloud the interpretation. Likewise, printing out massive
lists of all possible summary statistics of all possible sets, subset and sub-subsets isn't worth the paper it is printed
Values and ranks. Give absolute values when looking at relative changes. For example, 2 out of 100 million versus 1 out of 100 million is a
100% increase, but it is still only 2 out of 100 million. Trust your reader to understand the practical implications of the
Ranking anything without giving absolute values and/or some sense of comparison to practical importance can be misleading.
For example, consider that we ranked schools using a metric that results in one school being at the bottom of the list and
another school at the top. But then, we realize both schools produced Nobel Prize winners. Does the ranking therefore have
Charts and graphs. Avoid pie charts unless your goal is to deliberately confuse your reader. Graph the data before starting a formal statistical
analysis. Common graphs include histograms, time plots, and scatter plots. Attempt to get cause and effect on the same page
Underlying data. Attempt to determine the underlying distribution of the data before starting a formal statistical analysis. While the normal
distribution is the most common, many other distributions, such as the log-normal, are also found.
If the data are symmetrical around the average, use the average and the standard deviation. If the data are skewed, the median
and interquartile range is preferred. This is not a hard and fast rule, just good practice.