{"id":1872,"date":"2014-05-14T14:35:18","date_gmt":"2014-05-14T13:35:18","guid":{"rendered":"http:\/\/www.blopig.com\/blog\/?p=1872"},"modified":"2014-05-20T10:50:14","modified_gmt":"2014-05-20T09:50:14","slug":"distance-matrix-clustering","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2014\/05\/distance-matrix-clustering\/","title":{"rendered":"Distance matrix clustering"},"content":{"rendered":"<p>In Bioinformatics, we often deal with distance matrices such as:<\/p>\n<ul>\n<li>Quantifying pairwise similarities between sequences<\/li>\n<li>Structural similarity between proteins (RMSD?)<\/li>\n<li>etc.<\/li>\n<\/ul>\n<p>Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters beforehand (as for <a href=\"http:\/\/mathworld.wolfram.com\/K-MeansClusteringAlgorithm.html\">K-Means<\/a>). Assuming that you do not know very much about the data and &#8216;plotting&#8217; it is not an option, you might try non-parametric hierarchichal clustering such as <a href=\"http:\/\/mathworld.wolfram.com\/K-MeansClusteringAlgorithm.html\">linkage<\/a>. The main difference between the two approaches is that using linkage you specify what the maximal distance within each cluster should be and thus the number of clusters will be adjusted accordingly. Par contre, using K-Means you do not have such a distance-guarantee within each cluster since the number of groups is predefined.<\/p>\n<p>Here I will provide a short piece of python code that employs the <a href=\"https:\/\/pypi.python.org\/pypi\/hcluster\/0.2.0\">hcluster<\/a> library to perform linkage clustering.<\/p>\n<p><strong>Installation<\/strong><\/p>\n<p>Download <a href=\"https:\/\/pypi.python.org\/pypi\/hcluster\/0.2.0\">hcluster<\/a>, unpack it and inside the unpacked folder type:<\/p>\n<pre class=\"lang:sh decode:true\">python setup.py install<\/pre>\n<p>Alternatively, if you&#8217;re not an admin on your machine type:<\/p>\n<pre class=\"lang:default decode:true\">python setup.py install --user<\/pre>\n<p><strong>\u00a0Example Code<\/strong><\/p>\n<p>The purpose of the example bit of code is to generate a random set of points within (0,10) in the 2D space and cluster them according to user&#8217;s euclidean distance cutoff.<\/p>\n<pre class=\"lang:python decode:true\">import matplotlib.pyplot as plt\r\nfrom matplotlib.pyplot import show\r\nfrom hcluster import pdist, linkage, dendrogram\r\nimport numpy\r\nimport random\r\nimport sys\r\n\r\n#Input: z= linkage matrix, treshold = the treshold to split, n=distance matrix size\r\ndef split_into_clusters(link_mat,thresh,n):\r\n   c_ts=n\r\n   clusters={}\r\n   for row in link_mat:\r\n      if row[2] &lt; thresh:\r\n          n_1=int(row[0])\r\n          n_2=int(row[1])\r\n\r\n          if n_1 &gt;= n:\r\n             link_1=clusters[n_1]\r\n             del(clusters[n_1])\r\n          else:\r\n             link_1= [n_1]\r\n\r\n          if n_2 &gt;= n:\r\n             link_2=clusters[n_2]\r\n             del(clusters[n_2])\r\n          else:\r\n             link_2= [n_2]\r\n\r\n\t  link_1.extend(link_2)\r\n          clusters[c_ts] = link_1\r\n          c_ts+=1\r\n      else:\r\n          return clusters\r\n\r\n#Size of the point matrix\r\nrows = 10 #number of points\r\ncolumns = 2 #number of dimensions - 2=2D, 3=3D etc.\r\nsamples = numpy.empty((rows, columns))\r\n\r\n#Initialize a random points matrix with values between 0, 10 (all points in the upper right 0,10 quadrant)\r\nfor i in xrange(0,rows):\r\n    for j in xrange(0,columns):\r\n       samples[i][j] = random.randint(0,10)\r\n\r\n#Show the points we have randomly generated\r\nprint \"Samples:\\n \", samples\r\n\r\n#Create the distance matrix for the array of sample vectors.\r\n#Look up 'squareform' if you want to submit your own distance matrices as they need to be translated into reduced matrices\r\ndist_mat = pdist(samples,'euclidean')\r\n\r\n#Perform linkage clustering - here we use 'average', but other options are possible which you can explore here: http:\/\/docs.scipy.org\/doc\/scipy-0.13.0\/reference\/generated\/scipy.cluster.hierarchy.linkage.html\r\nz = linkage(dist_mat, method='average',metric='euclidean')\r\n\r\n#Specify a cutoff that will define the clustering - command line argument: \r\n#python ClusterExample.py 3.0\r\ncutoff = float(sys.argv[1])\r\nclustering = split_into_clusters(z,cutoff,rows)\r\nif clustering==None:\r\n\tprint \"No clusters! Most likely your distance cutoff is too high - all falls into the same bag!\"\r\n\tquit()\r\n\r\n#Print the potential singletons - in magenta\r\nfor i in xrange(0,rows):\r\n\tplt.plot(samples[i][0],samples[i][1], marker='o', color='m', ls='')\r\n\tplt.text(samples[i][0],samples[i][1], str(i), fontsize=12)\r\n#If there are more clusters than these the code will fail!\r\ncolors = ['b','r','g','y','c','k']\r\n\r\ncluster_num = 0\r\nfor cluster in clustering:\r\n   print \"Cluster: \",cluster_num\r\n   cluster_num+=1\r\n   for i in clustering[cluster]:\r\n\tprint \"--&gt;\",i\r\n\tplt.plot(samples[i][0],samples[i][1], marker='o', color=colors[cluster_num], ls='')\r\n\r\n#Set the axis limits\r\nplt.xlim([-1,12])\r\nplt.ylim([-1,12])\r\n\r\nshow()\r\n#Alternatively plot it as dendogram to see where your distance cutoff performed the tree cut\r\ndendrogram(z)\r\nshow()<\/pre>\n<p>When I ran the code above (python [whatever you call the script].py 2.0) that&#8217;s what I got (colors correspond to clusters with &#8216;magenta&#8217; being singletons):<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-1875\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_1.png?resize=300%2C226&#038;ssl=1\" alt=\"figure_1\" width=\"300\" height=\"226\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_1.png?resize=300%2C226&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_1.png?resize=624%2C470&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_1.png?w=812&amp;ssl=1 812w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>And there is a dendogram command on the bottom of the script to see what the clustering has actually done and where it performed the cut according to your specified cutoff (colors DO NOT correspond to clusters here):<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_2.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-1876\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_2.png?resize=300%2C226&#038;ssl=1\" alt=\"figure_2\" width=\"300\" height=\"226\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_2.png?resize=300%2C226&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_2.png?resize=624%2C470&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2014\/05\/figure_2.png?w=812&amp;ssl=1 812w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Hcluster library forms part of <a href=\"http:\/\/docs.scipy.org\/doc\/scipy-0.13.0\/reference\/index.html\">scipy<\/a> with very useful methods for data analysis. You can modify the above code to use a variety of other hierarchichal clustering methods which you can \u00a0further explore <a href=\"http:\/\/docs.scipy.org\/doc\/scipy-0.13.0\/reference\/generated\/scipy.cluster.hierarchy.linkage.html\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Bioinformatics, we often deal with distance matrices such as: Quantifying pairwise similarities between sequences Structural similarity between proteins (RMSD?) etc. Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"ppma_author":[482],"class_list":["post-1872","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":482,"user_id":4,"is_guest":0,"slug":"konrad","display_name":"Konrad Krawczyk","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/fdb224fe7b0775e3c9a6956ae2a5ffd7c35ab8ce3ff99c5f6e0a51d45557cdd6?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Scientist","first_name":"Lucky","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1872","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=1872"}],"version-history":[{"count":8,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1872\/revisions"}],"predecessor-version":[{"id":1891,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1872\/revisions\/1891"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=1872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=1872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=1872"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}