HOW TO BE A BAYESIAN - ft. a completely ridiculous example¶

In an ideal world, whenever we performed an experiment we would be completely certain that it represented the ground truth. In reality, random fluctuations mean we can rarely be totally sure that our experimental observations do represent the truth.

A typical frequentist analysis pipeline might run as follows: Declare a hypothesis; design an experiment which tests if that is true; conduct the experiment; inspect the results and decide if you have enough evidence to conclude that your hypothesis is probably true (or false).

Bayesian analyses differ from frequentist ones because we allowed to incorporate our prior knowledge of the subject into the post-experiment inferences. This can be useful for a number of reasons. In this blog post we'll walk through a simple Bayesian analysis of a binary event. We'll discuss how to objectively quantify a prior distribution and combine the prior with our data to obtain a posterior distribution.

To get the most out of this post you should have a good idea of what a probability distribution is and know (in principle) how to combine a prior distribution and a likelihood to obtain a posterior.(I would like to apologise in advance to any mathematicians who stumble across this post in case the lack of mathematical rigour makes you feel ill)

#Load in required modules
import pandas as pd 
from scipy.optimize import minimize_scalar
from scipy.stats import beta
from scipy.stats import binom
import numpy as np
import matplotlib.pyplot as plt

A Binary experiment¶

To conduct this analysis, we need some data. Because in OPIG we only address the highly important questions, I asked the following question:

"Once opened, should tomato ketchup be stored in the fridge or not?" (YES or NO)

I'm interested (stretching the meaning of that word to its very limits) in making inferences about the true value of p, the probability with which a randomly chosen individual does store their ketchup in a fridge.

Specifying my prior distribution¶

The first step in any Bayesian analysis is writing down what you already know - i.e. specify your prior distribution. This can seem like a daunting step because of the potential consequences for your analysis if you get it wrong (see below), but there are a number of constraints we can do to reduce the complexity of picking a suitable prior.

Our prior distribution is a distribution on p - the probability that someone refrigerates their ketchup. p takes values between 0 and 1 so for each value in that region we need some notion of how likely it is that p takes that value.

The first constraint we will make is that our prior must follow a beta distribution. The reason for choosing a beta prior for binary data is that it will make computing our posterior distribution much much easier (Google conjugate prior). Our prior elicitation now boils down to choosing the prior hyperparameters (a,b) which control the shape of the beta dist'n.

It can be shown that if p ~ Beta(a,b), then E(p) = a/(a + b). We want to choose a and b so that the expected value of p is sensible, given our prior knowledge:

Let's imagine I went away and did lots of research about ketchup refrigeration and assembled a panel of experts to advise me. I might come to the conclusion that it was likely that somewhere around 55% of people store their ketchup in the fridge. In that case it would be reasonable to suggest that E(p) = 0.55 = a/(a + b). In that case, we can say a = 11*b/9 (1).

Now we only have a single parameter to pick. Another distributional fact we might have an opinion about is the probability of p being below a certain value. For illustrative purposes, imagine that my cadre of ketchup experts told me that it would be unlikely that p < 0.4 (a 20% chance). The beta distribution cdf is parameterised by a and b so using (1) we can find the value of b that gives P(p < 0.4| b) = 0.2.

If I'm honest I didn't want to wade through the algebra to find the value of b so I used a function optimiser to find an approximate value for me (yes, I know that makes me a bad person).

#Want the value of b which minimises this function:
def hyperparam_grid_search(b):
    a = 11*b/9
    return np.absolute(beta.cdf(0.4, a, b, loc = 0, scale = 1) - 0.2)


min_param = minimize_scalar(hyperparam_grid_search, method = 'brent', options={'xtol' : 1e-8})

print("The hyperparameters which determine my prior are:")
print(min_param.x*11/9, min_param.x)

The hyperparameters which determine my prior are:
4.285476474234375 3.5062989334644894

beta.cdf(0.4, 4.2854, 3.506) #Check that the parameters we got do give roughly the right value - should be ~=0.2

0.19997809646079884

Let's visualise our prior over the range which it is defined¶

fig = plt.figure()
ax = plt.axes()
plt.xlim(0,1)
x = np.linspace(0, 1, 1000)
ax.plot(x, beta.pdf(x, min_param.x*11/9, min_param.x));

The graph above represents our prior on p - i.e. our current belief about the value it might take.

I collected some data to give an indication of the preferences of some real people and see whether reality ties in with our prior distribution. We'll calculate the posterior distribution which will modify my prior distribution to take into account the new information.

In a shining example of scientific rigour I obtained my data by badgering the other OPIGlets who happened to be in the office one morning. You can see the results below

names_list = ['Eve', 'Constantin', 'Sarah', 'Angela', 'Anne', 'Dom', 'Claire', 'Fergus 1', 'Aleks', 'Dan', 'Mark 1', 'Clare', 'Carlos', 'Susan', 'Lucian', 'Tom', 'Lyuba', 'Marc 2', 'Mark 3', 'Garrett', 'Charlotte', 'Jack', 'Matt', 'Fergus 2', 'Catherine']
fridge_list = [1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,0]
dataset_dict = {'name': names_list, 'fridge': fridge_list}

data = pd.DataFrame(dataset_dict)

print(data)

          name  fridge
0          Eve       1
1   Constantin       1
2        Sarah       1
3       Angela       0
4         Anne       1
5          Dom       1
6       Claire       1
7     Fergus 1       1
8        Aleks       0
9          Dan       1
10      Mark 1       1
11       Clare       1
12      Carlos       1
13       Susan       1
14      Lucian       1
15         Tom       0
16       Lyuba       1
17      Marc 2       1
18      Mark 3       1
19     Garrett       1
20   Charlotte       0
21        Jack       0
22        Matt       1
23    Fergus 2       1
24   Catherine       0

So overall 19 OPIGlets store their ketchup in a fridge whilst 6 do not. Because a beta distributed prior is conjugate to a binomial likelihood, all we need to do to calculate the posterior is update the prior hyperparameters depending on our results. This is taken care of in the method update_beta_priors().

class beta_dist:
    def __init__(self, a = 1, b = 1):
        self.a = a
        self.b = b
        
    #Get the beta pdf
    def get_pdf(self):
        x = np.linspace(0, 1, 1000)
        fx = beta.pdf(x, self.a, self.b)
        dens_dict = {'x': x, 'fx': fx}
        return(dens_dict)
        
    #Update parameters:
    def update_beta_params(self, n, num_successes):
        self.old_a = self.a
        self.old_b = self.b
        self.a = self.a + num_successes
        self.b = self.b + n - num_successes

KETCHUP = beta_dist(a = 4.285476474234375, b = 3.5062989334644894)
prior = KETCHUP.get_pdf()

#Obtain data to update hyperparameters
total_participants = data.shape[0]
how_many_chill_their_ketchup = sum(data['fridge']) 

#Update hyperparameters
KETCHUP.update_beta_params(n = total_participants, num_successes = how_many_chill_their_ketchup)
posterior = KETCHUP.get_pdf()
print("The updated hyperparameters are:")
print(KETCHUP.a, KETCHUP.b)
post_mean = KETCHUP.a/(KETCHUP.a + KETCHUP.b)
MLE = how_many_chill_their_ketchup/total_participants

print("The mean value of our posterior is %s" %post_mean)
print("Our Maximum Likelihood estimate is %s" %MLE)

#Plot prior and posterior
plt.plot(prior['x'], prior['fx'])
plt.plot(posterior['x'], posterior['fx'])
plt.axvline(x=how_many_chill_their_ketchup/total_participants, color = "red")
plt.axvline(x = post_mean, color = "green")
plt.legend(['Prior Distribution', 'Posterior Distribution','ML estimate of p', 'Posterior mean'], loc='upper left')

plt.show()

The updated hyperparameters are:
23.285476474234375 9.506298933464489
The mean value of our posterior is 0.7101011209282497
Our Maximum Likelihood estimate is 0.76

What does all this mean?¶

We started out with a prior distribution on p. The prior distribution represented my beliefs about how likely it was for p (the proportion of people who chill their Tomato Ketchup) to take certain values. I thought it was likely p was just over 0.5 but not by too much.

The results of our experiment indicated that, in OPIG at least, many more than 50% of people store chill their ketchup (The estimated proportion was 0.76).

Taking into account my prior uncertainty regarding p and the data, we arrived a posterior which combined the two in an objective way. Using the posterior mean as an estimate, I now believe that 71% of people put their ketchup in the fridge (compared to the prior mean of 55%).

But can't I just use my prior to bias my results?!¶

Well, yes and no. Lets see what would happen if I used an incredibly strong prior for p, with an expected value of 0.05.

KETCHUP_BAD_PRIOR = beta_dist(a = 5, b = 95)
prior = KETCHUP_BAD_PRIOR.get_pdf()

#Obtain data to update hyperparameters
total_participants = data.shape[0]
how_many_chill_their_ketchup = sum(data['fridge']) 

#Update hyperparameters
KETCHUP_BAD_PRIOR.update_beta_params(n = total_participants, num_successes = how_many_chill_their_ketchup)
posterior = KETCHUP_BAD_PRIOR.get_pdf()
print("The updated hyperparameters are:")
print(KETCHUP_BAD_PRIOR.a, KETCHUP_BAD_PRIOR.b)
post_mean = KETCHUP_BAD_PRIOR.a/(KETCHUP_BAD_PRIOR.a + KETCHUP_BAD_PRIOR.b)
MLE = how_many_chill_their_ketchup/total_participants

print("The mean value of our posterior is %s" %post_mean)
print("Our Maximum Likelihood estimate is %s" %MLE)

#Plot prior and posterior
plt.plot(prior['x'], prior['fx'])
plt.plot(posterior['x'], posterior['fx'])
plt.axvline(x=how_many_chill_their_ketchup/total_participants, color = "red")
plt.axvline(x = post_mean, color = "green")
plt.legend(['Prior Distribution', 'Posterior Distribution','ML estimate of p', 'Posterior mean'], loc='upper right')

plt.show()

The updated hyperparameters are:
24 101
The mean value of our posterior is 0.192
Our Maximum Likelihood estimate is 0.76

Despite the data coming down heavily on the side of p being closer to 1 than 0, our posterior mean in 0.192¶

In this case the conclusions we would draw using a Bayesian approach would be markedly different than if we used a frequentist approach, due to our selected prior.

Why do we bother with prior distributions then?¶

Before we answer that, lets keep our bad prior and instead of using data from the 25 people I could grab lurking around the Stats department we'll pretend that we asked 2000 people and got the same proportion of responses (1520 refrigerators vs 480 room-temperaturers).

KETCHUP_BAD_PRIOR = beta_dist(a = 5, b = 95)
prior = KETCHUP_BAD_PRIOR.get_pdf()

#Obtain data to update hyperparameters
total_participants = 2000
how_many_chill_their_ketchup = 1520

#Update hyperparameters
KETCHUP_BAD_PRIOR.update_beta_params(n = total_participants, num_successes = how_many_chill_their_ketchup)
posterior = KETCHUP_BAD_PRIOR.get_pdf()
print("The updated hyperparameters are:")
print(KETCHUP_BAD_PRIOR.a, KETCHUP_BAD_PRIOR.b)
post_mean = KETCHUP_BAD_PRIOR.a/(KETCHUP_BAD_PRIOR.a + KETCHUP_BAD_PRIOR.b)
MLE = how_many_chill_their_ketchup/total_participants

print("The mean value of our posterior is %s" %post_mean)
print("Our Maximum Likelihood estimate is %s" %MLE)

#Plot prior and posterior
plt.plot(prior['x'], prior['fx'])
plt.plot(posterior['x'], posterior['fx'])
plt.axvline(x=how_many_chill_their_ketchup/total_participants, color = "red")
plt.axvline(x = post_mean, color = "green")
plt.legend(['Prior Distribution', 'Posterior Distribution','ML estimate of p', 'Posterior mean'], loc='upper left')

plt.show()

The updated hyperparameters are:
1525 575
The mean value of our posterior is 0.7261904761904762
Our Maximum Likelihood estimate is 0.76

We can see that, despite starting with a totally garbage prior, our posterior estimate has almost converged to the frequentist one. That is, if you have enough data, it doesn't matter what your prior distribution looks like.

The benefit of using a well selected prior distribution is that, if used properly, it acts like a regulariser on the experimental data. The more strongly you hold your current beliefs, the more compelling the evidence should be to change them. Using a Bayesian approach also gives us a distribution on p, which can be useful if we're making decisions based on the estimate we obtained.

For those who remain suspicious of Bayesian methods, that's ok. You should continue using standard frequentist tools which no one has ever mis-used such as p-values \s.