{"id":13214,"date":"2025-12-10T15:22:34","date_gmt":"2025-12-10T15:22:34","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13214"},"modified":"2025-12-10T15:22:36","modified_gmt":"2025-12-10T15:22:36","slug":"some-thoughts-on-molecular-similarity","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/12\/some-thoughts-on-molecular-similarity\/","title":{"rendered":"Some thoughts on molecular similarity"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Molecular similarity is a tricky concept, mostly because there are many ways to define and measure similarity. For example, two molecules could be considered similar because they have the same biological effect, or because they have identical molecular weight, or because they contain the same functional groups, etc., etc. A natural follow-on question from this is &#8220;what is the correct way to measure molecular similarity?&#8221; and the answer, unfortunately, is that it depends.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As an example of these complexities, Greg Landrum has a <a href=\"https:\/\/greglandrum.github.io\/rdkit-blog\/posts\/2025-07-17-naming-similarity-metrics.html\">great blog post<\/a> on how Tanimoto similarity changes depending on how you vectorise a molecule, and the need for authors to clarify the vectorisation method used. Variation in Tanimoto similarities is also something \u00cdsak has written about on <a href=\"https:\/\/www.blopig.com\/blog\/2024\/09\/tanimoto-similarity-of-ecfps-with-rdkit-common-pitfalls\/\">blopig<\/a>.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h3 class=\"wp-block-heading\">Why is molecular similarity important in drug discovery?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Molecular similarity is important in drug discovery for many reasons. In lead optimisation, for example, known binders to a target protein might be used to search chemical libraries for similar molecules with greater binding affinities to the target. Similarity is also crucial for understanding <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S1359644614000361\">activity cliffs<\/a>, where structurally similar molecules have very different binding affinities for the same target protein.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For molecular property prediction with machine learning, molecular similarity between the training and testing sets can <a href=\"https:\/\/practicalcheminformatics.blogspot.com\/2024\/11\/some-thoughts-on-splitting-chemical.html\" data-type=\"link\" data-id=\"https:\/\/practicalcheminformatics.blogspot.com\/2024\/11\/some-thoughts-on-splitting-chemical.html\">bias the performance outcomes of a model<\/a>. Understanding and mitigating these similarity biases is necessary, as a large part of drug discovery involves finding novel molecules that do not exist yet (i.e., dissimilar) for therapeutics.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Some different ways of measuring molecular similarity<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">As I&#8217;ve said, there are many ways to measure molecular similarity. Here are a few example approaches with practical applications (although there are others too!).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tanimoto similarity<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The Tanimoto coefficient (also known as the Jaccard coefficient) is a method for measuring the similarity between two sets, and is a very common method in cheminformatics. For two sets, A and B, the Tanimoto coefficient is defined as<\/p>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+T%28A%2CB%29+%3D+%5Cfrac%7B%7CA+%5Ccap+B%7C%7D%7B%7CA%7C+%2B+%7CB%7C+-+%7CA+%5Ccap+B%7C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle T(A,B) = &#92;frac{|A &#92;cap B|}{|A| + |B| - |A &#92;cap B|}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In other words, the Tanimoto coefficient is a ratio between the intersection of two sets and the total size of both sets minus their overlap. We can apply this concept to calculate the Tanimoto coefficient between binary vectors. For two vectors,<\/p>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Cmathbf%7Bx%7D+%3D+%28x_1%2C+x_2%2C...%2Cx_d%29%2C+x_i+%5Cin+%5C%7B0%2C1%5C%7D%2C%5Cquad+%5Cmathbf%7By%7D+%3D+%28y_1%2C+y_2%2C...%2Cy_d%29%2C+y_i+%5Cin+%5C%7B0%2C1%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;mathbf{x} = (x_1, x_2,...,x_d), x_i &#92;in &#92;{0,1&#92;},&#92;quad &#92;mathbf{y} = (y_1, y_2,...,y_d), y_i &#92;in &#92;{0,1&#92;}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">the Tanimoto coefficient is calculated as<\/p>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+T%28%5Cmathbf%7Bx%7D%2C%5Cmathbf%7By%7D%29+%3D+%5Cfrac%7B%5Csum_%7Bi%3D1%7D%5E%7Bd%7D%7Bx_i+y_i%7D%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bd%7Dx_i+%2B+%5Csum_%7Bi%3D1%7D%5E%7Bd%7Dy_i+-+%5Csum_%7Bi%3D1%7D%5E%7Bd%7D%7Bx_i+y_i%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle T(&#92;mathbf{x},&#92;mathbf{y}) = &#92;frac{&#92;sum_{i=1}^{d}{x_i y_i}}{&#92;sum_{i=1}^{d}x_i + &#92;sum_{i=1}^{d}y_i - &#92;sum_{i=1}^{d}{x_i y_i}}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To apply this to chemistry, we need to express molecules as binary vectors; there are a lot of ways to do this. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Physicochemical descriptor vectors (PDV), containing values like molecular weight (MW), can be binarised using pre-defined thresholds, e.g., set a bit to 1 if MW &gt; 200 Da, else 0. Substructural keys, such as <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/ci010132r\">MACCS<\/a>, calculate binary vectors based on the presence or absence of pre-defined topological features or substructures in a molecule.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A common method for binarising molecules is topological fingerprints, which denote the presence or absence of algorithmically enumerated substructures in a molecule. For example, <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/ci100050t\">Extended-Connectivity Fingerprints<\/a> (ECFPs) enumerate local circular atomic neighbourhoods, which are hashed into 64-bit integer identifiers and folded into a fixed-length bit vector (e.g., 1024 or 2048 bits). Folding is typically performed by a second hashing function, but there are alternative methods like <a href=\"http:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-024-00932-y\">Sort and Slice<\/a>. Functional-Connectivity Fingerprints (FCFP) are similar to ECFPs but define similar atoms, such as halogens, as the same. Path-based methods, such as the RDKit Fingerprint, enumerate linear paths rather than circular substructures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Topological fingerprints can also vary based on user-defined parameters, such as bit-length, maximum radius for ECFPs and FCFPs, and maximum path length for the RDKit fingerprints.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Cosine similarity<\/h4>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Cosine similarity measures the angle between two vectors; orthogonal vectors have a similarity of 0, and aligned vectors have a similarity of 1. For two vectors <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbf%7Bx%7D%2C+%5Cmathbf%7By%7D+%5Cin+%5Cmathbb%7BR%7D%5Ed&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbf{x}, &#92;mathbf{y} &#92;in &#92;mathbb{R}^d\" class=\"latex\" \/>, the Cosine similarity between them is<\/p>\n<\/div>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=cosine%28%5Cmathbf%7Bx%7D%2C%5Cmathbf%7By%7D%29+%3D+%5Cfrac%7B%5Cmathbf%7Bx%7D+%5Ccdot+%5Cmathbf%7By%7D%7D%7B%5C%7C%5Cmathbf%7Bx%7D%5C%7C%5C%7C%5Cmathbf%7By%7D%5C%7C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"cosine(&#92;mathbf{x},&#92;mathbf{y}) = &#92;frac{&#92;mathbf{x} &#92;cdot &#92;mathbf{y}}{&#92;|&#92;mathbf{x}&#92;|&#92;|&#92;mathbf{y}&#92;|}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Cosine similarity is less common in cheminformatics than Tanimoto similarity, but it is effective for measuring similarity between continuous-valued vectors and <a href=\"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-015-0069-3\">can still be useful for delineating between molecules<\/a>. The most common usage of Cosine similarity that I&#8217;ve seen recently is to evaluate the quality of learned embeddings. For example, the authors of <a href=\"https:\/\/www.nature.com\/articles\/s41467-024-53751-y\">MolE<\/a> applied Cosine similarity to learned embeddings for a comparison with ECFPs with Tanimoto similarity.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Maximum Common Edge Subgraph<\/h4>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>I got the idea for this blog post from a paper on coverage bias in molecular datasets by <a href=\"https:\/\/www.nature.com\/articles\/s41467-024-55462-w\">Kretschmer et al. (2025)<\/a>, where the authors used Maximum Common Edge Subgraph (MCES) as a method for measuring distance between molecules.  The MCES between two molecules can be thought of as the largest subgraph that the molecules share if you were to align their skeletal structures.<\/p>\n<p>The authors defined the MCES distance between two molecular graphs, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G_1%3D%28V_1%2C+E_1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G_1=(V_1, E_1)\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G_2%3D%28V_2%2C+E_2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G_2=(V_2, E_2)\" class=\"latex\" \/>, as:<\/p>\n<\/div>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Cmathrm%7BMCES_%7Bdist%7D%7D+%3D+%7CE_1%7C+%2B+%7CE_2%7C+-+2%7CE_C%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;mathrm{MCES_{dist}} = |E_1| + |E_2| - 2|E_C|\" class=\"latex\" \/><\/p>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>where <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7CE_1%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"|E_1|\" class=\"latex\" \/> is the number of bonds in the first molecule, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7CE_2%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"|E_2|\" class=\"latex\" \/> is the number of bonds in the second molecule, and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7CE_C%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"|E_C|\" class=\"latex\" \/> is the number of bonds in the MCES.<\/p>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">This is useful for measuring differences in global 2D shape, where a higher MCES distance means two molecules are more structurally dissimilar. However, I think MCES distance would be more informative if expressed as a ratio (by using the Tanimoto coefficient again!):<\/p>\n\n\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+%5Cmathrm%7BMCES_%7Bsim%7D%7D+%3D+%5Cfrac%7B%7CE_C%7C%7D%7B%7CE_1%7C+%2B+%7CE_2%7C+-+%7CE_C%7C%7D%2C+%5Cquad+%5Cmathrm%7BMCES_%7Bdist%7D%7D+%3D+1+-+%5Cmathrm%7BMCES_%7Bsim%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;displaystyle &#92;mathrm{MCES_{sim}} = &#92;frac{|E_C|}{|E_1| + |E_2| - |E_C|}, &#92;quad &#92;mathrm{MCES_{dist}} = 1 - &#92;mathrm{MCES_{sim}}\" class=\"latex\" \/><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are a couple of reasons for using a ratio. Firstly, the ratio normalises for molecular size; a difference of just a few bonds is proportionally more significant for smaller molecules than for larger ones, and a ratio reflects this. Secondly, a ratio limits the metric to between 0 (no common edges) and 1 (identical molecules), making comparing MCES similarity scores for different molecular pairs more straightforward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than capturing similarities in local topologies, like Tanimoto similarity with ECFPs, MCES similarity captures differences in global 2D topology. Both have their merits; high similarity in local topologies indicates a similar functional group context, and high similarity in global 2D topology indicates a more common overall structure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Which measure of molecular similarity should you use?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Each method has pros and cons. Tanimoto similarity can be fast and straightforward to interpret with the right featurisation approach, which is great for evaluating large datasets. However, there are many binary vectorisation methods to choose from, some of which are susceptible to bit collisions (e.g., ECFPs). Cosine similarity is useful for investigating the quality of learned embeddings and can be used with PDVs, but highly correlated features can overemphasise similarity. MCES is useful for comparing global 2D structural similarity (and can be used with Tanimoto similarity), but exact computation is expensive; approximations, such as myopic MCES, can reduce runtime.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The key is to be consistent and explicit about your setup.<\/strong> When claiming two molecules are \u201csimilar,\u201d state the representation (e.g., Morgan radius, bit length, chirality\/aromaticity settings), the similarity measure (Tanimoto, Cosine, etc.), and any parameters or thresholds used.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Molecular similarity is a tricky concept, mostly because there are many ways to define and measure similarity. For example, two molecules could be considered similar because they have the same biological effect, or because they have identical molecular weight, or because they contain the same functional groups, etc., etc. A natural follow-on question from this [&hellip;]<\/p>\n","protected":false},"author":118,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[187],"tags":[],"ppma_author":[760],"class_list":["post-13214","post","type-post","status-publish","format-standard","hentry","category-cheminformatics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":760,"user_id":118,"is_guest":0,"slug":"sam","display_name":"Sam Money-Kyrle","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/784870e2ed5304f12f11366dad56cbf1c0b9aa63bd80021ae235ba5f30536a12?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Money-Kyrle","first_name":"Sam","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13214","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/118"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13214"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13214\/revisions"}],"predecessor-version":[{"id":13822,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13214\/revisions\/13822"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13214"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13214"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13214"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13214"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}