Why can a man not lift himself by pulling up on his bootstrap hypothesis test?

This blogpost highlights a typical mistake when performing the bootstrap hypothesis test. Bootstrapping is a method of resampling data to estimate measures of variability, such as confidence intervals or variance. 

In the simplest form of the bootstrap, assume you have a set of values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You want to estimate the mean and variability of the mean using these data. The recipe is as follows:

  1. Resample the data with replacement B times.
  2. Compute the mean of each of the resampled dataset, you will obtain B means: m1, … , mB
  3. You can now obtain a confidence interval by simply computing empirical quantiles of m1, … , mB, directly.

Seems pretty easy and you can use any statistic instead of the mean and we also did not have to make any parametric assumptions. You can do this for very, very complicated statistics and the procedure is identical.

What if I had two groups: Group 1: 1, 2, 3, 4, 5 and Group 2: 2, 3, 4, 5, 6 and I wanted to know if the means of these groups are equal. A naive approach is incorrect:

  1. Apply the bootstrap algorithm for Group 1 and compute a mean and confidence interval
  2. Does the empirical mean of Group 2 lie in the tails of the bootstrap confidence interval of Group 1?

The correct approach is slightly more involved, the above method ignored the potentially different sample sizes and variability of the two groups.

  1. Compute the t-statistic of Group 1 and Group 2 using the mean, variance and sample size of each group. Call this t
  2. Call Group 1: x and Group 2: y and the pool of both groups z
  3. Generate a new Group 1 and new Group 2, which we call x’ and y’
    1. x’ = x – mean(x) + mean(z)
    2. y’ = y – mean(y) + mean(z)
  4. Draw Bootstrap samples of x’ and y’, for say B = 1000
  5. For each Bootstrap dataset compute the t-statistic each time to obtain t*1,…,t*B
  6. The p-value is simply the proportion of times t* is bigger than t.

For more exotic examples: “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy”, Statistical Science, 1986.

Author