An Overview of Clustering Algorithms

During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.

There are several types of clustering algorithms that offer different advantages.

Centroid-based Clustering

Centroid-based clustering is probably the most commonly used type of clustering algorithm. As the name suggests, algorithms of this type work by choosing a number of data points as centroids and linking the remaining points to one of the centroids. Algorithms are designed to find optimal cluster centres and generally assign data points to the nearest one. Clusters produced by centroid-based clustering are circular in shape.

K-means [1] is an example of a centroid-based algorithm. To run the algorithm, the number of clusters (K) has to be pre-specified. K centroids are selected at random and data points are initially assigned to one of them. The position of the centroids is iteratively updated to minimise the variance within clusters and data points are reassigned to a new centroid.

A downside of K-means is that the number of clusters has to be known in advance. Affinity propagation [2] offers a good alternative if this is not the case. The algorithm considers all data points as potential centres. Messages determining how well each point would serve as a cluster centre are passed between pairs of data points. Once an optimal set of centroids is found, the remaining points are assigned.

Hierachical Clustering

Instead of simply producing clusters from the dataset, hierarchical clustering algorithms construct a hierarchy of clusters. This type of clustering is best explained by the example of agglomerative clustering [3]. The algorithm starts with each data point assigned to its own cluster. During each iteration, the two clusters closest in space are merged. Once the algorithm has finished, it generates a dendrogram that illustrates the hierarchical relationships between the clusters, indicating the iterations during which the clusters were merged. The dendrogram can then be cut at any point to obtain a clustering. The point to cut the dendrogram is usually chosen based on a threshold distance between the linked clusters or a specified number of clusters.

An alternative algorithm is divisive clustering, which essentially works in the opposite way. All points are initially considered part of the same cluster and the cluster is iteratively split into two smaller clusters.

An advantage of hierarchical clustering is that clusters can have any shape and do not need to be circular.

Density-based Clustering

Density-based clustering algorithms group data points by detecting areas where points are highly concentrated surrounded by sparse areas. An advantage of these algorithms is that outliers are recognised and not assigned to a cluster. Clusters produced by density-based clustering can have any shape.

The most popular density-based algorithm is DBSCAN [4]. DBSCAN takes two input parameters: epsilon and minPts. Circles of radius epsilon are drawn around each data point. Points are then labelled as core points, boundary points or outliers. Core points are defined as points having a minimum number of neighbours (minPts) within epsilon. Boundary points have fewer than minPts neighbours but are within epsilon of a core point. The remaining points are labelled as outliers. All core points that are directly connected (within each other’s neighbourhood) or connected through a chain of core points are assigned to a cluster. Boundary points are added to the same cluster as core points in their neighbourhood.

OPTICS [5] is another commonly used density-based algorithm. In contrast to DBSCAN, the algorithm is able to detect clusters of varying densities.

Implementations of all clustering algortihms discussed are available from scikit-learn.

References

[1] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics 5.1, 281–298

[2] Frey, B. J. and Dueck, D. (2007). enClustering by Passing Messages Between Data Points. Science 315, 972–976. doi:10.1126/science.1136800

[3] Murtagh, F. and Contreras, P. (2012). enAlgorithms for hierarchical clustering: an overview. WIREs Data Mining and Knowledge Discovery 2, 86–97. doi:10.1002/widm.53

[4] Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems 42, 19:1–19:21. doi:10.1145/3068335

[5] Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Record 28, 49–60. doi:10.1145/304181.304187

    Author