{"id":2779,"date":"2016-01-07T00:51:22","date_gmt":"2016-01-07T00:51:22","guid":{"rendered":"http:\/\/www.blopig.com\/blog\/?p=2779"},"modified":"2016-01-07T00:53:11","modified_gmt":"2016-01-07T00:53:11","slug":"we-can-model-everything-right","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2016\/01\/we-can-model-everything-right\/","title":{"rendered":"We can model everything, right&#8230;?"},"content":{"rendered":"<h3>First, happy new year to all our Blopig fans, and we all hope 2016 will be\u00a0awesome!<\/h3>\n<p>A couple of months ago, I was covering this <a href=\"http:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/prot.24916\/full\">article<\/a>\u00a0by Shalom Rackovsky.\u00a0The big question that jumps out of the paper is, <em>has modelling reached its limits?<\/em> Or, in other words, can bioinformatics techniques be used to model <em>every<\/em> protein? The author argues that protein structures have an inherent level of variability that cannot be fully captured by computational methods; thus, he raises\u00a0some scepticism on what modelling can achieve. This isn&#8217;t entirely news; competitions such as CASP show that there&#8217;s still lots to work on in this field. This article takes a very interesting spin when\u00a0Rackovsky uses a theoretical basis to justify his claim.<\/p>\n<p>For a pair of proteins\u00a0<em>P\u00a0<\/em>and\u00a0<em>Q,\u00a0<\/em>Rackovsky defines their relationship depending on their sequence and structural identity. If\u00a0<em>P\u00a0<\/em>and\u00a0<em>Q\u00a0<\/em>share a high level of sequence identity but have little\u00a0structural resemblance,\u00a0<em>P\u00a0<\/em>and\u00a0<em>Q\u00a0<\/em>are considered to be a\u00a0<em>conformational switch<\/em>. Conversely, if\u00a0<em>P\u00a0<\/em>and\u00a0<em>Q\u00a0<\/em>share a low level of sequence identity but have high structural resemblance, they are considered to be\u00a0<em>remote homologues.\u00a0<\/em><\/p>\n<div id=\"attachment_2781\" style=\"width: 499px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/dnap.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-2781\" loading=\"lazy\" class=\"size-full wp-image-2781\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/dnap.png?resize=489%2C420&#038;ssl=1\" alt=\"Case of a conformational switch - two DNAPs with 100% seq identity but 5.3A RMSD.\" width=\"489\" height=\"420\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/dnap.png?w=489&amp;ssl=1 489w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/dnap.png?resize=300%2C258&amp;ssl=1 300w\" sizes=\"auto, (max-width: 489px) 100vw, 489px\" \/><\/a><p id=\"caption-attachment-2781\" class=\"wp-caption-text\">Case of a conformational switch &#8211; two DNAPs with 100% seq identity but 5.3A RMSD.<\/p><\/div>\n<div id=\"attachment_2782\" style=\"width: 447px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/hbb.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-2782\" loading=\"lazy\" class=\"size-full wp-image-2782\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/hbb.png?resize=437%2C391&#038;ssl=1\" alt=\"Haemoglobins are 'remote homolgues' - despite 19% sequence identity, these two proteins have 1.9A RMSD.\" width=\"437\" height=\"391\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/hbb.png?w=437&amp;ssl=1 437w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/hbb.png?resize=300%2C268&amp;ssl=1 300w\" sizes=\"auto, (max-width: 437px) 100vw, 437px\" \/><\/a><p id=\"caption-attachment-2782\" class=\"wp-caption-text\">Haemoglobins are &#8216;remote homolgues&#8217; &#8211; despite 19% sequence identity, these two proteins have 1.9A RMSD.<\/p><\/div>\n<p>From here on comes the complex maths. Rackovsky&#8217;s work here (and in papers prior, <a href=\"http:\/\/www.pnas.org\/content\/95\/15\/8580.abstract\">example<\/a>) assume that there are periodicities in properties of proteins, and thus apply fourier transforms to compare protein sequences and structures.<\/p>\n<p>In the case of comparing protein sequences, instead of treating sequences\u00a0as a string of letters, protein sequences are characterised by an\u00a0<em>N\u00a0<\/em>x 10 matrix. <em>N\u00a0<\/em>represents\u00a0the number of amino acids in protein\u00a0<em>P\u00a0<\/em>(or\u00a0<em>Q<\/em>), and each amino acid has 10 <a href=\"http:\/\/www.pnas.org\/content\/107\/19\/8623\/T1.expansion.html\">biophysical properties<\/a>. The matrix then undergoes Fourier Transformation (FT), and the resulting sine and cosine coefficients for proteins\u00a0<em>P\u00a0<\/em>and\u00a0<em>Q\u00a0<\/em>are used to calculate the Euclidean distance between each other.<\/p>\n<p>When comparing structures, proteins are initially truncated into length-L fragments, and the dihedral angle, bond length and bond angle for each fragment is collected into a matrix. The distribution of matrices allows us to project proteins onto a pre-parameterised principal components space. The Euclidean distance between the newly-projected proteins is then used to quantify protein structural similarity.<\/p>\n<p>In both sequence and structure distances, the distances are normalised and centred around 0,0 by calculating the average distance between\u00a0<em>P\u00a0<\/em>and its\u00a0<em>M-nearest\u00a0<\/em>neighbours, and then adjusted by the global average. Effectively, if a protein has an average structural distance, it will tend toward 0,0.<\/p>\n<p>The author uses a dataset of 12000 proteins from the CATH set to generate the following diagram; the Y-axis represents sequence similarity and the X-axis is the structural similarity. Since these axes are scaled to the mean, the closer you are to 0, it means you&#8217;re closer to the global average sequence or structure distance.<\/p>\n<div id=\"attachment_2785\" style=\"width: 907px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/rackovskyplot.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-2785\" loading=\"lazy\" class=\"wp-image-2785 size-full\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/rackovskyplot.png?resize=625%2C599&#038;ssl=1\" alt=\"rackovskyplot\" width=\"625\" height=\"599\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/rackovskyplot.png?w=897&amp;ssl=1 897w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/rackovskyplot.png?resize=300%2C287&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2016\/01\/rackovskyplot.png?resize=624%2C598&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><p id=\"caption-attachment-2785\" class=\"wp-caption-text\">The four quadrants: along the diagonal is a typical linear relationship (greater sequence identity = more structural similarity). The lower-right quadrant represents proteins with LOW sequence similarity yet HIGH structural similarity. In the upper-left quadrant, proteins have LOW structural similarity but HIGH sequence similarity.<\/p><\/div>\n<p>Rackovsky argues that, while the remote homologue and conformational switch seem like rare phenomena, it accounts for approximately ~50% of his dataset. Although he does account for the high density of proteins within 0,0, the paper does not clearly address the meaning of these new metrics. In other words, the author does not\u00a0translate these values to something we&#8217;re more familiar with (e.g.RMSD, and sequence identity % for structural and sequence distance). Although the whole idea is that his methods are supposed to be an alignment-free method, it&#8217;s still difficult to\u00a0draw relationships to what we already use as the gold standard in traditional protein structure prediction problems.\u00a0Also, note that the structure distance spans between -0.1 and 0.1 units whereas sequence identity spans between -0.3 and 0.5. The differences in scale are also not covered &#8211; i.e., is a difference of 0.01 units an expected value for protein structure distance, and why are the jumps in protein structure distance so much smaller than jumps in sequence space?<\/p>\n<p>The author makes more interesting observations in the dataset (e.g. \u03b1\/\u03b2\u00a0mixed proteins are more tolerant to mutations in comparison to \u03b1- or \u03b2-only proteins) but the observations are not discussed in depth. If\u00a0\u03b1\/\u03b2-mixed proteins are indeed more resilient to mutations,\u00a0<em>why\u00a0<\/em>is this the case? Conversely, if small mutations change\u00a0\u03b1- or \u03b2-only proteins&#8217; structures to make new folds, having any speculation on the underlying mechanism (e.g. maybe \u03b1-only proteins are only\u00a0sensitive to radically different amino acid substitutions, such as ALA-&gt;ARG)\u00a0will only\u00a0help our prediction methods. Overall I had the impression that the author was a bit too pessimistic about what modelling can achieve. Though we definitely cannot model all proteins that are out there at present,\u00a0I believe\u00a0the surge of new sources of data (e.g. cryo-EM structures) will provide an alternative inference route for better prediction methods in the future.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>First, happy new year to all our Blopig fans, and we all hope 2016 will be\u00a0awesome! A couple of months ago, I was covering this article\u00a0by Shalom Rackovsky.\u00a0The big question that jumps out of the paper is, has modelling reached its limits? Or, in other words, can bioinformatics techniques be used to model every protein? [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"ppma_author":[511],"class_list":["post-2779","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":511,"user_id":22,"is_guest":0,"slug":"jinwoo","display_name":"Jinwoo Leem","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/65d338dc0b03d3026aa9a98f5e43889ca6c9ac9d0f45fe65ea5931207597ce2d?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/2779","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=2779"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/2779\/revisions"}],"predecessor-version":[{"id":2787,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/2779\/revisions\/2787"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=2779"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=2779"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=2779"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2779"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}