Multiple Testing: What is it, why is it bad and how can we avoid it?

P-values play a central role in the analysis of many scientific experiments. But, in 2015, the editors of the Journal of Basic and Applied Social Psychology prohibited the usage of p-values in their journal. The primary reason for the ban was the proliferation of results obtained by so-called ‘p-hacking’, where a researcher tests a range of different hypotheses and publishes the ones which attain statistical significance while discarding the others. In this blog post, we’ll show how this can lead to spurious results and discuss a few things you can do to avoid engaging in this nefarious practice.

The Basics: What IS a p-value?

Under a Hypothesis Testing framework, a p-value associated with a dataset is defined as the probability of observing a result that is at least as extreme as the observed one, assuming that the null hypothesis is true. If the probability of observing such an event is extremely small, we conclude that it is unlikely the null hypothesis is true and reject it.

But therein lies the problem. Just because the probability of something is small, that doesn’t make it impossible. Using the standard significance test threshold of 0.05, even if the null hypothesis is true, there is a 5% chance of obtaining a p-value below the significance threshold and therefore rejecting it. Such false positives are an inescapable part of research; there’s always a possibility that the subset you were working with isn’t representative of the global data and sometimes we take the wrong decision even though we analysed the data in a perfectly rigorous fashion.

On the other hand, if there is a 5% chance of obtaining a positive result even when the null hypothesis is true, that makes it very easy to manufacture a positive result if you need one: Just keep running experiments until one comes back with p < 0.05.

Obtaining a spurious result

To illustrate the above point, observe as I uncover a cure for COVID-19 based on healing crystals! Suppose we knew that the probability of someone dying of COVID (with no medical intervention at all) was 10%. To demonstrate the healing potential of the crystal, I tracked down (simulated) 100 people with a positive COVID test and gave them a healing crystal. We tracked the proportion of our simulated patients who died within two weeks and aimed to show that exposure to the healing powers of the crystal reduced the probability of dying.

Conducting a hypothesis test to see if healing crystals can cure COVID-19

In this case, as we set the probability of death after using the crystal to be the same as for when the patient went completely untreated, it’s not surprising that we failed to reject the null hypothesis. However, now I claim that the colour of the crystal is essential in determining in whether it has healing properties against COVID-19, so I ran the same experiment for lots of different colours (21 in total):

different_colours = ['Red', 'Orange', 'York', 'Green', 'Blue', 'Indigo', 'Violet', 'Pink', 'Magenta', 'Cyan', 'Purple', 'Gold', 'Silver', 'Bronze', 'Copper', 'Maroon', 'Grey', 'Black', 'Lavender', 'Apricot', 'Lime']

After waiting for two weeks and tallying up the number of deaths for all the different types of healing crystal, we made the astonishing discovery that Green healing crystals significantly reduced the number of deaths (p = 0.023).

Sadly, this is not a result you’ll be seeing in a reputable journal any time soon. The number of experiments with a p-value of less than 0.5 (1/21) is almost exactly what you’d expect to see when the null hypothesis and discarding the results of all the other experiments and just reporting the result associated with the Green crystals would be disingenuous in the extreme.

What can I do to avoid multiple testing?

Of course, in the real world, most instances of multiple testing are not as obviously nefarious as the one presented above. One way they can creep into an analysis unnoticed is via an exploratory analysis. For instance, if you were doing some simple visualisations of a response variable with a range of different explanatory variables and noticed a strong correlation between the response and another variable. If, on the basis of that visualisation, you then went onto conduct a significance test for an association between those two variables, you would be guilty of multiple testing even though you may only have computed a single p-value. Any scenario in which you use a dataset to derive a hypothesis, which you then test on the same dataset has the potential to be a multiple testing violation, so make sure to guard against doing this!

How can I avoid inflating the Type I error rate?

Multiple testing is bad because it leads to a heightened Type I error rate, i.e. we reject the null hypothesis when it is true more often. If you’re testing n different hypotheses, the simplest way to avoid inflating the type I error rate is to apply a Bonferroni Correction. Instead of rejecting the null hypothesis if the p-value is less than p, with a Bonferroni correction you would reject the null hypothesis if the p-value was less than p/n, requiring stronger evidence to reject the null. A more (statistically) powerful correction procedure is the Sidak correction, which works along similar lines.

Author