Pitfalls in Statistics

The hardest errors to spot are the ones that don't look like errors at all.
Apr 01, 2011


Lynn D. Torbeck
We all have a scrupulous eye for subjects we are passionate about. Foodies insist on using the most authentic ingredients for their favorite dishes. Classic-car collectors require the smallest details to be as close to the original as possible. And statisticians cajole nonstatisticians to avoid classic pitfalls in applied statistics. In this latter case, however, it is not a matter of taste or authenticity. Incorrect statistical practices can result in erroneous calculations and poor conclusions. Some errors are small, but others can be monumental, and since we never know which way the apple will fall, we should treat them all the same.

Although there is always a large scope for error in a statistical project, some mistakes are more common than others. Those that appear to be correct on the surface are what we call pitfalls.

The most common and deadliest pitfalls

This section highlights some of the most common challenges facing statisticians.

Reportable values. Reportable value or result is not defined for the data and the analysis (1). By definition, the reportable value is the end result of the complete measurement method as documented. It is the value compared with the specification and the official value most often used for statistical analysis. If different people or departments have different definitions, confusion reigns and out-of-tolerance and out-of-specification investigations multiply.

Averages. The average of a set of averages is correct only if the sample sizes are the same. Otherwise, the averages are weighted by the sample sizes (2). In addition, avoid averaging standard deviations even when the sample sizes are the same. The variance is the standard deviation squared. Variances can be averaged when the sample sizes are the same. If the sample sizes are not the same, then a weighting formula is used (2).

The percentage relative standard deviation. The percentage relative standard deviation (%RSD) is not a substitute for the standard deviation because they measure different aspects of variation. Report both with the sample size. Also, avoid trying to average %RSDs or calculate %RSD on data expressed as percentage recovery. The data is already a percentage, so the average and the standard deviation will also be percentages.

Sample size. Always report the sample size. Remember the famous rat study where one third of the rats got better, one third got worse and the third rat ran away?

Summaries. Avoid gratuitous summary statistics without a clear purpose; they cloud the interpretation. Likewise, printing out massive lists of all possible summary statistics of all possible sets, subset and sub-subsets isn't worth the paper it is printed on.

Values and ranks. Give absolute values when looking at relative changes. For example, 2 out of 100 million versus 1 out of 100 million is a 100% increase, but it is still only 2 out of 100 million. Trust your reader to understand the practical implications of the data.

Ranking anything without giving absolute values and/or some sense of comparison to practical importance can be misleading. For example, consider that we ranked schools using a metric that results in one school being at the bottom of the list and another school at the top. But then, we realize both schools produced Nobel Prize winners. Does the ranking therefore have any meaning?

Charts and graphs. Avoid pie charts unless your goal is to deliberately confuse your reader. Graph the data before starting a formal statistical analysis. Common graphs include histograms, time plots, and scatter plots. Attempt to get cause and effect on the same page (3).

Underlying data. Attempt to determine the underlying distribution of the data before starting a formal statistical analysis. While the normal distribution is the most common, many other distributions, such as the log-normal, are also found.

If the data are symmetrical around the average, use the average and the standard deviation. If the data are skewed, the median and interquartile range is preferred. This is not a hard and fast rule, just good practice.