{"id":9893,"date":"2023-05-23T11:27:36","date_gmt":"2023-05-23T10:27:36","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=9893"},"modified":"2023-05-23T16:14:49","modified_gmt":"2023-05-23T15:14:49","slug":"pairwise-sequence-identity-and-tanimoto-similarity-in-pdbbind","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2023\/05\/pairwise-sequence-identity-and-tanimoto-similarity-in-pdbbind\/","title":{"rendered":"Pairwise sequence identity and Tanimoto similarity in PDBbind"},"content":{"rendered":"\n<p>In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.<\/p>\n\n\n\n<p>A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the <em>BulkTanimotoSimilarity<\/em> function in RDKit. The following code should do the trick:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from rdkit.Chem import AllChem, MolFromMol2File\nfrom rdkit.DataStructs import BulkTanimotoSimilarity\nimport numpy as np\nimport os\n\nfps = []\nfor pdb in pdbs:\n    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))\n    fps.append(AllChem.GetMorganFingerprint(mol, 3))\n\nsims = []\nfor i in range(len(fps)):\n    sims.append(BulkTanimotoSimilarity(fps[i],fps))\n\narr = np.array(sims)\nnp.savez_compressed('data\/tanimoto_similarity.npz', arr)<\/pre>\n\n\n\n<p>Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call <em>QUERY.fasta<\/em>. This is what the first few lines look like:<\/p>\n\n\n\n<!--more-->\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>3zzf_A\nNGFSATRSTVIQLLNNISTKREVEQYLKYFTSVSQQQFAVIKVGGAIISDNLHELASCLA\nFLYHVGLYPIVLHGTGPQVNGRLEAQGIEPDYIDGIRITDEHTMAVVRKCFLEQNLKLVT\nALEQLGVRARPITSGVFTADYLDKDKYKLVGNIKSVTKEPIEASIKAGALPILTSLAETA\nSGQMLNVNADVAAGELARVFEPLKIVYLNEKGGIINGSTGEKISMINLDEEYDDLMKQSW\nVKYGTKLKIREIKELLDYLPRSSSVAIINVQDLQKELFTDSGAGTMIRRGY\n>3gww_A\nREHWATRLGLILAMAGNAVGLGNFLRFPVQAAENGGGAFMIPYIIAFLLVGIPLMWIEWA\nMGRYGGAQGHGTTPAIFYLLWRNRFAKILGVFGLWIPLVVAIYYVYIESWTLGFAIKFLV\nGLVPEPPPDSILRPFKEFLYSYIGVPKGDEPILKPSLFAYIVFLITMFINVSILIRGISK\nGIERFAKIAMPTLFILAVFLVIRVFLLETPNGTAADGLNFLWTPDFEKLKDPGVWIAAVG\nQIFFTLSLGFGAIITYASYVRKDQDIVLSGLTAATLNEKAEVILGGSISIPAAVAFFGVA\nNAVAIAKAGAFNLGFITLPAIFSQTAGGTFLGFLWFFLLFFAGLTSSIAIMQPMIAFLED\nELKLSRKHAVLWTAAIVFFSAHLVMFLNKSLDEMDFWAGTIGVVFFGLTELIIFFWIFGA\nDKAWEEINRGGIIKVPRIYYYVMRYITPAFLAVLLVVWAREYIPKIMEETHWTVWITRFY\nIIGLFLFLTFLVFLAERRRNH\n>1w8l_A\nMVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKGFGYKGSCFHRIIPGF\nMCQGGDFTRHNGTGGKSIYGEKFEDENFILKHTGPGILSMANAGPNTNGSQFFICTAKTE\nWLDGKHVVFGKVKEGMNIVEAMERFGSRNGKTSKKITIADCGQLE<\/pre>\n\n\n\n<p>Most of the of the speed gains that MMseqs2 achieves in many-against-many sequence searching comes from its prefiltering stage, where it removes most target sequences for a given query that don&#8217;t fulfil certain requirements. Since we want to calculate the sequence identity for every pair, we have to do a fake prefiltering step to enable all-vs-all alignments and sequence identity calculations. As described in the user guide, we can create a shell function that performs this step:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">fake_pref() {\nQDB=\"$1\"\nTDB=\"$2\"\nRES=\"$3\"\n# create link to data file which contains a list of all targets that should be\n\u21aa aligned\nln -s \"${TDB}.index\" \"${RES}\"\n# create new index repeatedly pointing to same entry\nINDEX_SIZE=\"$(echo $(wc -c &amp;lt; \"${TDB}.index\"))\"\nawk -v size=$INDEX_SIZE '{ print $1\"\\t0\\t\"size; }' \"${QDB}.index\" &amp;gt; \"${RES}.index\"\n# create dbtype (7)\nawk 'BEGIN { printf(\"%c%c%c%c\",7,0,0,0); exit; }' &amp;gt; \"${RES}.dbtype\"\n}<\/pre>\n\n\n\n<p>Now we have all we need to calculate the sequence identity. We first create query and target databases, which are the same for us, followed by the fake prefiltering step. In the alignment step we prevent any sequences from being ignored with the <em>-e inf<\/em> option. Also, make sure to give the option <em>&#8211;alignment-mode 3, <\/em>in order to get the sequence identity. By default the sequence identity is calculated as the number of identical residues divided by the alignment length, however, the length of the shorter sequence for instance can be used intead with the option <em>&#8211;seq-id-mode 1<\/em>. Finally, the results are combined into <em>results.m8<\/em> with the <em>convertalis<\/em> command.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">mmseqs createdb QUERY.fasta targetDB --shuffle 0 \nmmseqs createdb QUERY.fasta queryDB --shuffle 0\nfake_pref queryDB targetDB allvsallpref\nmmseqs align queryDB targetDB allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1\nmmseqs convertalis queryDB targetDB allvsallaln results.m8<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations. A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve [&hellip;]<\/p>\n","protected":false},"author":116,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[348,29,361,296,587,227,201],"tags":[344,24,152,129,134],"ppma_author":[718],"class_list":["post-9893","post","type-post","status-publish","format-standard","hentry","category-bash","category-code","category-data-science","category-hints-and-tips","category-macos","category-python-code","category-small-molecules","tag-bash","tag-bioinformatics","tag-python","tag-rdkit","tag-small-molecules"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":718,"user_id":116,"is_guest":0,"slug":"isak","display_name":"\u00cdsak Valsson","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/e335be777e21e786472b602b6d40c02b157226069db9a9665c60acba0e612f86?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9893","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/116"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=9893"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9893\/revisions"}],"predecessor-version":[{"id":9913,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9893\/revisions\/9913"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=9893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=9893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=9893"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}