Distance matrix clustering | Oxford Protein Informatics Group

In Bioinformatics, we often deal with distance matrices such as:

Quantifying pairwise similarities between sequences
Structural similarity between proteins (RMSD?)
etc.

Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters beforehand (as for K-Means). Assuming that you do not know very much about the data and ‘plotting’ it is not an option, you might try non-parametric hierarchichal clustering such as linkage. The main difference between the two approaches is that using linkage you specify what the maximal distance within each cluster should be and thus the number of clusters will be adjusted accordingly. Par contre, using K-Means you do not have such a distance-guarantee within each cluster since the number of groups is predefined.

Here I will provide a short piece of python code that employs the hcluster library to perform linkage clustering.

Installation

Download hcluster, unpack it and inside the unpacked folder type:

python setup.py install

Alternatively, if you’re not an admin on your machine type:

python setup.py install --user

Example Code

The purpose of the example bit of code is to generate a random set of points within (0,10) in the 2D space and cluster them according to user’s euclidean distance cutoff.

import matplotlib.pyplot as plt
from matplotlib.pyplot import show
from hcluster import pdist, linkage, dendrogram
import numpy
import random
import sys

#Input: z= linkage matrix, treshold = the treshold to split, n=distance matrix size
def split_into_clusters(link_mat,thresh,n):
   c_ts=n
   clusters={}
   for row in link_mat:
      if row[2] < thresh:
          n_1=int(row[0])
          n_2=int(row[1])

          if n_1 >= n:
             link_1=clusters[n_1]
             del(clusters[n_1])
          else:
             link_1= [n_1]

          if n_2 >= n:
             link_2=clusters[n_2]
             del(clusters[n_2])
          else:
             link_2= [n_2]

	  link_1.extend(link_2)
          clusters[c_ts] = link_1
          c_ts+=1
      else:
          return clusters

#Size of the point matrix
rows = 10 #number of points
columns = 2 #number of dimensions - 2=2D, 3=3D etc.
samples = numpy.empty((rows, columns))

#Initialize a random points matrix with values between 0, 10 (all points in the upper right 0,10 quadrant)
for i in xrange(0,rows):
    for j in xrange(0,columns):
       samples[i][j] = random.randint(0,10)

#Show the points we have randomly generated
print "Samples:\n ", samples

#Create the distance matrix for the array of sample vectors.
#Look up 'squareform' if you want to submit your own distance matrices as they need to be translated into reduced matrices
dist_mat = pdist(samples,'euclidean')

#Perform linkage clustering - here we use 'average', but other options are possible which you can explore here: http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.cluster.hierarchy.linkage.html
z = linkage(dist_mat, method='average',metric='euclidean')

#Specify a cutoff that will define the clustering - command line argument: 
#python ClusterExample.py 3.0
cutoff = float(sys.argv[1])
clustering = split_into_clusters(z,cutoff,rows)
if clustering==None:
	print "No clusters! Most likely your distance cutoff is too high - all falls into the same bag!"
	quit()

#Print the potential singletons - in magenta
for i in xrange(0,rows):
	plt.plot(samples[i][0],samples[i][1], marker='o', color='m', ls='')
	plt.text(samples[i][0],samples[i][1], str(i), fontsize=12)
#If there are more clusters than these the code will fail!
colors = ['b','r','g','y','c','k']

cluster_num = 0
for cluster in clustering:
   print "Cluster: ",cluster_num
   cluster_num+=1
   for i in clustering[cluster]:
	print "-->",i
	plt.plot(samples[i][0],samples[i][1], marker='o', color=colors[cluster_num], ls='')

#Set the axis limits
plt.xlim([-1,12])
plt.ylim([-1,12])

show()
#Alternatively plot it as dendogram to see where your distance cutoff performed the tree cut
dendrogram(z)
show()

When I ran the code above (python [whatever you call the script].py 2.0) that’s what I got (colors correspond to clusters with ‘magenta’ being singletons):

And there is a dendogram command on the bottom of the script to see what the clustering has actually done and where it performed the cut according to your specified cutoff (colors DO NOT correspond to clusters here):

Hcluster library forms part of scipy with very useful methods for data analysis. You can modify the above code to use a variety of other hierarchichal clustering methods which you can further explore here.

Author

Konrad Krawczyk

View all posts