Drawing Wavy Lines That Match Your Data, or, An Introduction to Kernel Density Estimation

One of the fundamental questions of statistics is “How likely is it that event X will occur, given what we’ve observed already?”. It’s a question that pops up in all sorts of different fields, and in our daily lives as well, so it’s well worth being able to answer rationally. Under the statistician’s favourite assumption that the observed data are independent and identically distributed (i.i.d.), we can use the data to construct a probability distribution; that is, if we’re about to observe a new data point, x*, we can say how likely it is that x* will take a specific value.

The most popular approach for specifying such probability distributions is known as Maximum Likelihood Estimation (MLE). MLE generally involves making the assumption that the unknown distribution which generated your data belongs to a particular family of distributions and then finding the distribution parameters most likely to yield the data. For example, we might conclude that the data are normally distributed and then set about trying to find the most plausible mean and variance values.

An example of data that would be difficult to associate with a particular family of distributions – in this case standard parametric approaches would not accurately estimate the probability density (image from: http://pubs.sciepub.com/education/7/8/8/figure/14)

As always seems to be the case when we make assumptions in statistics, MLE works really well when our assumptions are (more or less) correct but tends to lead to unusual conclusions when they aren’t. Non-parametric methods allow us to reduce the number of assumptions about the data that we make and so are typically more robust than their parametric counterparts. In this blog post we’ll look at Kernel Density Estimation, which allows us to estimate how likely an event is while only using the assumption that the data are i.i.d.

Kernel Density Estimation

Suppose we have a dataset, x = (x1,…,xn) which are i.i.d. realisations of a random variable with density function f, and we want to estimate f. That is, for each possible value, x*, we want to accurately estimate what the probability density f(x*) is. Our approximating function, f*, should fulfil two criteria:

  1. As with all probability distributions, if we intergrate f*(x) over all the set of values x can take, it must sum to 1
  2. In general, for a point x*, if x* is close to lots of points in x, then it should receive a higher density than if it is not close to any points in x.

To satisfy these criteria, we can specify f* to be of the following form:

Where K is a function that is small for large positive and negative values, is large at zero and integrates to 1. This formulation allows us to assign high density to regions where the dataset is well represented, and low density to areas which are not close to any points in the dataset.

There are lots of possible forms which K can take, but a popular one is the Gaussian Kernel, which takes the form:

Kernel density estimators are simple to implement in R or Python and are a robust alternative to parametric density estimation. They can also be extended, via Bayes’ Rule, to give a non-parametric regression method, known as the Nadaraya-Watson method. If you’re interested in making your analyses more robust, consider trying them out!

Author