No labels, no problem! A quick introduction to Gaussian Mixture Models

Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.

For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!

In this blog post, we’re going to take a quick look at the Linear Discriminant Analysis (LDA) model before looking at its unsupervised extension, the Gaussian Mixture Model. We won’t use any more algebra than is necessary to give some intuition as to what’s going on, so fair warning to any mathematician who might find my vigorous hand waving personally offensive!

Linear Discriminant Analysis

Let’s start by defining our dataset. For each observation we have a set of continuous features, X, and an associated categorical label, Y. Our task is to predict the correct value Y* for a new example X*. For the LDA model, we know exactly what the true label is for each observation in our training set and we make the following assumption:

  • For a given observation, the features X were generated by a ‘ground truth’ multivariate normal distribution. Each label is associated with a different normal distribution, so if two observations have the same label, they were generated by the same distribution, but if they have a different label they were generated by different distributions.

Given this assumption we have a recipe for generating a new data point (X*, Y*). First we randomly pick what the label, Y*, of the data point is. Then we sample from the normal distribution associated with Y* to get X*.

An example of how we would model the generation of a dataset under the LDA model. The red numbers are the theoretical values used to generate the data – in practice we don’t know what they are and need to find estimates for these values.

Assuming this is the way the data were actually generated, we don’t know for certain what the true distribution parameters are (although for simplicity we’re assuming we do know what the covariance matrix is for each Normal distribution). We need to estimate what the underlying label probabilities are, and then for each label we need to estimate the mean of the associated Normal distribution. Once we have estimates for all of these parameters, we can predict the label of a new datapoint X* using the following procedure:

  • Calculate the conditional probability P(Y*=y|X*) for each possible label y
  • Whichever label has the highest probability is the one we assign X* to.
Data generated using the process outlined above. The colour of a point represents the underlying value of Y – we can see most of the points with the same colour are clustered together, although there are some points which are situated between clusters.

Parameter estimation

For each distribution, the parameters are estimated using Maximum Likelihood Estimation. This involves some dense algebra if you want to prove it for yourself, but happily the end results are exactly what you might expect intuitively. As mentioned above, for the sake of simplicity, we’re going to make the additional assumption that all of the normal distributions have the same known variance – whether or not that’s a reasonable assumption depends on the specific use case (but the method works in much the same way).

Estimation of label probabilities: To estimate the probability of a point having label y (unconditional on the value of X), simply calculate the proportion of observations in the training set which have the label y

p_i is our estimate for the probability that a randomly chosen point has label i. We calculate the proportion of examples in the training set with label i.

Estimation of Normal distribution parameters: To estimate the theoretical mean associated with a label, take all of the data points with that label and calculate the average of all of the feature vectors with that label

m_i is our estimate for the mean of the ith normal distribution. It’s calculated by taking the mean of the Xs in the training set with label i.

Model fitting and evaluation

By fitting an LDA model to the data above, we can predict the labels of each point in the training set (see results below). As the three clusters are separated quite clearly, we’re able to assign the correct label to the points in the dataset with high accuracy.

Left: Data plotted with the original labels. Right: Data coloured with the label assigned by the LDA model and the decision boundaries which partition the plane. All points in the same segment are assigned the same label. As a result we incorrectly labelled one of the points as yellow when the true label was brown.

Modelling unlabelled data

To estimate the parameters in the way described above, we needed to know what the label of each point was – without knowing what the labels were, the algorithm doesn’t work.

A Gaussian mixture model has much the same assumptions as the Linear Discriminant Analysis model – each data point is associated with a label, and each label is associated with a normal distribution which generates the data point. The difference is that for the Gaussian mixture model, the label is a latent variable (denoted as Z below), we never actually know for sure what the label is – therefore it’s not possible to compute the maximum likelihood estimates in the way we did for LDA.

The key difference in fitting the model is that the true label (which identifies the correct group for an observation with 100% certainty) is replaced by a probability distribution of what the label is, conditional on the associated value of X. This means that all data points are included in the estimation of the mean of a given cluster’s normal distribution parameters, but if for a given data point we believe there is only a small chance that the observation truly has that label given the value of X, then we weight that point so that it only has a very small contribution in estimating the parameter.

Q_ik is called the ‘responsibility’ of point i on cluster k. Given our current estimates of p_k and m_k, it represents the probability that data point i belongs to cluster k. Intuitively it makes sense that if we are confident that a point belongs to a given cluster, it should have more of an impact on the parameter estimation for that cluster than the points we think belong to another cluster.
Parameter estimates for the GMM model. Note that 1) the parameter estimates depend on Q_ik, which in turn depends on p_k and m_k (which is why we need to use an iterative approach for estimation) 2) The binary 0-1’s from LDA have been replaced by Q_ik, the estimated probability that a data point belongs to a cluster. In this way, all data points contribute to the estimation of all parameters.

The tricky part now is deciding how to determine what the probability a point belongs to a given cluster is – if we knew that then we’ve already accomplished our goal! Instead we’re forced to adopt a two-stage iterative approximation scheme called an Expectation-Maximisation (EM) algorithm. The EM algorithm works as follows:

  1. Start by randomly initialising values for the unconditional label probabilities and normal distribution parameters
  2. Given the current values for the unconditional label probabilities and normal distribution parameters, calculate the probability that each point belongs to each cluster, conditional on its feature vector
  3. Use the calculated probabilities to weight the contribution of each point to each cluster and update the current values for the unconditional label probabilities and normal distribution parameters
  4. Repeat steps (2)-(3) until convergence

It can be proved that repeating steps (2)-(3) above improves the models performance each time (although it will plateau eventually), so if you run the algorithm for long enough you’ll eventually arrive at a set of parameters and probabilities that give good performance.

If you’re familiar with the K-means algorithm, we can intuitively think of the EM algorithm in this setting as being quite similar to K-means – Assigning points to a cluster is analogous to updating the values of Q_ik (step 2 above), and updating the cluster centroid can be compared to recalibrating the parameter estimates (step 3 above). **That statement is devoid of any rigour whatsoever, please don’t hate me, mathematicians**

Fitting the mixture model

Left: Data with true labels; Centre: Data with labels and decision boundary generated by LDA; Right: Data with labels and decision boundary generated by the Gaussian Mixture Model. Note that the GMM has no concept of what the true labels are so it would be uncharitable to say it had misclassified most of the examples because they were the wrong colour – it has managed to identify three separate clusters of points which match up quite well with the original clusters.

Comparison to K-means algorithm

You’d be forgiven for wondering why, rather than putting ourselves through the pain of this approximation scheme, we don’t just use K-means, one of the simplest ML methods around. For the dataset we’ve presented, it is indeed hard to see what GMMs offer that K-means doesn’t.

However, Gaussian Mixture Models do offer some advantages over K-means:

  1. For our implementation of the GMM, we’ve assumed a shared, known, diagonal, covariance matrix. This need not be the case, and relaxing that assumption allows the GMM to make non-linear partitions of the feature space, which would be necessary with more complex data than we’ve used.
  2. For prediction of the label of a new example, whilst we get a prediction of the label with both GMMs and K-means, GMMs also give us an associated probability distribution too. That means we can get an estimation of how confident the model is that a point belongs to the cluster we’ve assigned it to.
  3. The fact that all data points contribute to the estimation of every parameter in GMMs would mitigate the impact of having overlap between two clusters, whereas K-means could not account for that.

Conclusion

That’s it for this post! We’ve seen how the Gaussian Mixture Model provides a framework for probabilistically identifying clusters in unlabelled datasets. Hopefully next time you’re faced with some unlabelled data, you’ll think about applying a probabilistic approach (even if for no other reason than it sounds fancier than using K-means!)



Author