{"id":4467,"date":"2019-01-08T16:27:52","date_gmt":"2019-01-08T16:27:52","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=4467"},"modified":"2019-01-08T16:27:52","modified_gmt":"2019-01-08T16:27:52","slug":"graph-based-methods-for-cheminformatics","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2019\/01\/graph-based-methods-for-cheminformatics\/","title":{"rendered":"Graph-based Methods for Cheminformatics"},"content":{"rendered":"<p>In cheminformatics, there are many possible ways to encode chemical data represented by small molecules and proteins, such as SMILES, fingerprints, chemical descriptors etc. Recently, utilising graph-based methods for machine learning have become more prominent. In this post, we will explore why representing molecules as graphs is a natural and suitable encoding.<!--more--><\/p>\n<p><strong>What is a graph?<\/strong><\/p>\n<p>First we should discuss what we mean when we use the term graph. We are not referring to a plot of a function, but instead a graph, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G\" class=\"latex\" \/>, consists of two sets: a set of vertices, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V\" class=\"latex\" \/>, and edges, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E\" class=\"latex\" \/> (I shall try to explain this using limited\/no mathematical notation). Each vertex, or node, in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V\" class=\"latex\" \/> is a point in the graph, while each element of <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E\" class=\"latex\" \/> consists of two elements from <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V\" class=\"latex\" \/> that are linked in some manner.<\/p>\n<p>This representation can be made more complex by allowing vertices to have different types, different types of edges, edges that only point in one direction (i.e. <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=a+%5Crightarrow+b&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"a &#92;rightarrow b\" class=\"latex\" \/> does not mean <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=b+%5Crightarrow+a&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"b &#92;rightarrow a\" class=\"latex\" \/>), and so on and so forth. In particular, one can label vertices (and edges) with any additional information you want (this will prove useful later).<\/p>\n<p><strong>Representing molecules as graphs<\/strong><\/p>\n<p>Molecules are simply atoms joined together by bonds. These atoms may well be of different types, and the bonds might also be different, but this sounds a lot like a graph where the atoms are the vertices and the bonds are the edges of our graph!<\/p>\n<div id=\"attachment_4473\" style=\"width: 635px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-4473\" loading=\"lazy\" class=\"wp-image-4473 size-large\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?resize=625%2C184&#038;ssl=1\" alt=\"\" width=\"625\" height=\"184\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?resize=1024%2C301&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?resize=300%2C88&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?resize=768%2C226&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?resize=624%2C183&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?w=1670&amp;ssl=1 1670w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/molecule_graph.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><p id=\"caption-attachment-4473\" class=\"wp-caption-text\">Figure 1: Left &#8211; Toluene in standard chemical notation; Right &#8211; Toluene in a visual graph format.<\/p><\/div>\n<p>Indeed, Figure 1 provides a simple visual example of the graph representation of a molecule, toluene (or methlybenzene). Since all atoms are carbon, it is possible to encode this molecule fully with a single atom type and two bond types (either single and aromatic, or single and double if kekulized). However, alternate atom typing is possible, for example taking separate atom types for aromatic carbons and aliphatic carbons. This highlights both the flexibility of a graph encoding but also the need to choose a representation, some of which may be more or less useful for the specific task.<\/p>\n<p><strong>What does a computer see?<\/strong><\/p>\n<p>Computationally, a graph is represented as two matrices: one for vertices, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V\" class=\"latex\" \/>, and either one or two for edges, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E\" class=\"latex\" \/>. The matrix for the vertices is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n+%5Ctimes+h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n &#92;times h\" class=\"latex\" \/> dimensional, and the adjacency matrix for the edges (capturing the connections between the vertices) is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=e+%5Ctimes+n+%5Ctimes+n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"e &#92;times n &#92;times n\" class=\"latex\" \/> dimensional, where <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> is the number of vertices, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h\" class=\"latex\" \/> is the length of the label associated with the vertex, and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=e&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"e\" class=\"latex\" \/> is the number of edge types (often vertices are one-hot encoded and there are multiple edge types, hence <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%2C+e+%3E+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h, e &gt; 1\" class=\"latex\" \/>. If not, then an adjacency matrix fully describes the graph).<\/p>\n<p>Using the example in Figure 1, taking two vertex types (aromatic &amp; aliphatic carbon) and two edge types (single &amp; aromatic), we can represent the graph by the matrices shown in Figure 2.<\/p>\n<div id=\"attachment_4478\" style=\"width: 635px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-4478\" loading=\"lazy\" class=\"wp-image-4478 size-large\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?resize=625%2C94&#038;ssl=1\" alt=\"\" width=\"625\" height=\"94\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?resize=1024%2C154&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?resize=300%2C45&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?resize=768%2C115&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?resize=624%2C94&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2019\/01\/matrix_repre_graph-1.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><p id=\"caption-attachment-4478\" class=\"wp-caption-text\">Figure 2: Matrix representation of toluene. Left &#8211; Vertex information; Right &#8211; Adjacency matrix.<\/p><\/div>\n<p><strong>Who cares?<\/strong><\/p>\n<p>We have described a simple, machine-readable format, that captures all basic features of a molecule, and is readily extendable to include any number of additional, user-defined vertex and edge features. Now that we have our molecular graph in matrix format, we can apply various graph-based machine learning methods on this. These are beyond the scope of this discussion, but could form the topic of a future blog post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In cheminformatics, there are many possible ways to encode chemical data represented by small molecules and proteins, such as SMILES, fingerprints, chemical descriptors etc. Recently, utilising graph-based methods for machine learning have become more prominent. In this post, we will explore why representing molecules as graphs is a natural and suitable encoding.<\/p>\n","protected":false},"author":50,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[187,10,14,189,188,201,15],"tags":[],"ppma_author":[535],"class_list":["post-4467","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-groupmeetings","category-howto","category-machine-learning","category-networks","category-small-molecules","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":535,"user_id":50,"is_guest":0,"slug":"fergus2","display_name":"Fergus Imrie","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/19c18fa7f4d0a2aecc5f69760c6a9f2fc9b493dfe45b1fd333ccb447db9d6a90?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4467","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/50"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=4467"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4467\/revisions"}],"predecessor-version":[{"id":4482,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4467\/revisions\/4482"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=4467"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=4467"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=4467"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=4467"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}