{"id":10958,"date":"2024-02-09T14:42:41","date_gmt":"2024-02-09T14:42:41","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=10958"},"modified":"2025-10-03T23:05:57","modified_gmt":"2025-10-03T22:05:57","slug":"working-with-pdb-structures-in-pandas","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/02\/working-with-pdb-structures-in-pandas\/","title":{"rendered":"Working with PDB Structures in Pandas"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a> is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Tools for the job<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">There are a couple of available tools to load PDB data into a pandas dataframe. <a href=\"https:\/\/biopandas.github.io\/biopandas\/\">BioPandas<\/a> is an open source library for this very task and makes the process easy. I also have developed my own library, <a href=\"https:\/\/github.com\/benjiemc\/PythonPDB\">PythonPDB<\/a>, that can load PDB files into pandas dataframes (I develop the library the to bridge a gap between the BioPython OOP style data model and the BioPandas dataframe data model).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>import nglview\nview = nglview.show_file('data\/6eqb.pdb')\nview<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"945\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?resize=625%2C945&#038;ssl=1\" alt=\"\" class=\"wp-image-10977\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?resize=677%2C1024&amp;ssl=1 677w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?resize=198%2C300&amp;ssl=1 198w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?resize=768%2C1161&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?resize=624%2C943&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.22.45%E2%80%AFPM.png?w=895&amp;ssl=1 895w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now let&#8217;s load this molecules into a pandas dataframe and do some analysis. The PDB data is loaded into columns in a similar way that PDB files are formatted as columns.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from python_pdb.parsers import parse_pdb_to_pandas\n\nwith open('data\/6eqb.pdb', 'r') as fh:\n    df = parse_pdb_to_pandas(fh.read())\n\ndf<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"250\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?resize=625%2C250&#038;ssl=1\" alt=\"\" class=\"wp-image-10976\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?resize=1024%2C410&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?resize=300%2C120&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?resize=768%2C307&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?resize=624%2C250&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.01.36%E2%80%AFPM.png?w=1105&amp;ssl=1 1105w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To start things off, let&#8217;s clean up this structure by highlighting one of the most powerful aspects of this approach: querying. As you can seen from the data frame above- there are water molecules in the structure that we might not care about. Let&#8217;s remove them\u2026<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df_clean = df.query(\"record_type == 'ATOM'\")  # or \"residue_name != 'HOH'\" would have worked as well\ndf_clean.tail()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">By using pandas&#8217; built in querying function- we can easily get rid of the HETATMs in the file that we might not care about. We can also use this querying to just select the TCR or pMHC to perform analysis on this molecule seperately. In this example, the TCR $\\alpha$- and $\\beta$ chain is labelled as chains D and E repectively. The MHC molecule is chain A (B- representing the $\\beta_2$m) and C the peptide.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>!head -n5 data\/6eqb.pdb<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>REMARK   5 IMGT RENUMBERED STRUCTURE 6EQB GENERATED BY STCRDAB\nREMARK   5 TCR CHAINS ARE RENUMBERED IN THE VARIABLE REGIONS ONLY\nREMARK   5 MHC CHAINS ARE RENUMBERED IN THE G DOMAINS OR FOR B2M-GLOBULIN\nREMARK   5 NON-TCR AND NON-MHC CHAINS ARE LEFT WITH RESIDUE IDS AS IN PDB\nREMARK   5 PAIRED_ABTCR BCHAIN=E ACHAIN=D MHCCHAINS=AB AGCHAIN=C AGTYPE=PEPTIDE\n\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>tcr_df = df_clean.query(\"chain_id == 'D' or chain_id == 'E'\")\ntcr_df<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"204\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?resize=625%2C204&#038;ssl=1\" alt=\"\" class=\"wp-image-10978\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?resize=1024%2C334&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?resize=300%2C98&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?resize=768%2C250&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?resize=624%2C203&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.16%E2%80%AFPM.png?w=1292&amp;ssl=1 1292w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>mhc_df = df_clean.query(\"chain_id == 'A'\")\nmhc_df<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"205\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?resize=625%2C205&#038;ssl=1\" alt=\"\" class=\"wp-image-10979\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?resize=1024%2C336&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?resize=300%2C98&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?resize=768%2C252&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?resize=624%2C205&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.25.47%E2%80%AFPM.png?w=1290&amp;ssl=1 1290w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>peptide_df = df_clean.query(\"chain_id == 'C'\")\n\npeptide_residues_df = peptide_df.groupby(&#091;'residue_seq_id', 'residue_insert_code'], dropna=False)\nprint(f'The peptide is a {len(peptide_residues_df)}-mer!')<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>The peptide is a 9-mer!<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Another advantage of using pandas is we can add new columns to annotate properties in the structure that we care about. In this example, since the TCR is from STCRDab and has been renumbered using the <a href=\"https:\/\/www.imgt.org\/IMGTScientificChart\/Nomenclature\/IMGT-FRCDRdefinition.html\">IMGT numbering convention<\/a> we can easily identify the complimentary determining regions based on their <code>residue_seq_id<\/code> property.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>IMGT_CDR1 = set(range(27, 38 + 1))\nIMGT_CDR2 = set(range(56, 65 + 1))\nIMGT_CDR3 = set(range(105, 117 + 1))\n\n\ndef assign_cdr_number(seq_id: int) -&gt; int | None:\n    '''\n    Map imgt_id to CDR domains, return number associated with domain or return None if input is not in a CDR\n    domain.\n    '''\n    if seq_id in IMGT_CDR1:\n        return 1\n\n    if seq_id in IMGT_CDR2:\n        return 2\n\n    if seq_id in IMGT_CDR3:\n        return 3\n\n    return None\n\ntcr_df = tcr_df.copy()  # Doing this on a copy of the dataframe since it is originally a slice of df!\ntcr_df&#091;'cdr'] = tcr_df&#091;'residue_seq_id'].map(assign_cdr_number)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can also annotations for the&nbsp;\u03b1-&nbsp;and \u03b2-chain since these are defined by the STCRDab header.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tcr_df&#091;'chain_type'] = tcr_df&#091;'chain_id'].map(lambda chain_id: 'alpha' if chain_id == 'D' else 'beta')<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now we can easily get rich information about the TCR CDR loops with ease.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tcr_cdrs_df = tcr_df.query('cdr.notnull()')\n\ncdr_lengths = tcr_cdrs_df&#091;\n    &#091;'chain_type', 'cdr', 'residue_seq_id', 'residue_insert_code']\n].drop_duplicates().groupby(&#091;'chain_type', 'cdr'], dropna=False).size()\n\ncdr_lengths<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>chain_type  cdr\nalpha       1.0     6\n            2.0     6\n            3.0     9\nbeta        1.0     6\n            2.0     5\n            3.0    13\ndtype: int64<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can also easily properties such as b-factors in the TCR variable domain.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import seaborn as sns\ntcr_variable_df = tcr_df.query('residue_seq_id &lt;= 129')\nsns.lineplot(x=tcr_variable_df&#091;'residue_seq_id'], y=tcr_variable_df&#091;'b_factor'])<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/image.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"563\" height=\"433\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/image.png?resize=563%2C433&#038;ssl=1\" alt=\"\" class=\"wp-image-10980\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/image.png?w=563&amp;ssl=1 563w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/image.png?resize=300%2C231&amp;ssl=1 300w\" sizes=\"auto, (max-width: 563px) 100vw, 563px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, computing things like contacting residues between the TCR CDR loops and the peptide is a breeze.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\nCONTACT_DISTANCE = 5  # Angstroms (\u00c5)\n\ndef euclidean_distance(x1, y1, z1, x2, y2, z2):\n    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2  + (z2 - z1)**2)\n\n\ninterface = tcr_cdrs_df.merge(peptide_df, how='cross', suffixes=('_tcr', '_peptide'))\ninterface&#091;'atom_distances'] = euclidean_distance(interface&#091;'pos_x_tcr'], interface&#091;'pos_y_tcr'], interface&#091;'pos_z_tcr'],\n                                                 interface&#091;'pos_x_peptide'], interface&#091;'pos_y_peptide'], interface&#091;'pos_z_peptide'])\n\ncontacting_atoms = interface&#091;interface&#091;'atom_distances'] &lt;= CONTACT_DISTANCE]\ncontacting_residues = contacting_atoms&#091;&#091;'chain_id_tcr', 'residue_seq_id_tcr', 'residue_insert_code_tcr', 'cdr', 'chain_type',\n                                        'residue_seq_id_peptide', 'residue_insert_code_peptide']].drop_duplicates()\n\ncontacting_residues<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"402\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?resize=625%2C402&#038;ssl=1\" alt=\"\" class=\"wp-image-10981\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?resize=1024%2C659&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?resize=300%2C193&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?resize=768%2C494&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?resize=624%2C402&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/02\/Screenshot-2024-02-09-at-2.30.39%E2%80%AFPM.png?w=1170&amp;ssl=1 1170w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The full notebook can be <a href=\"https:\/\/github.com\/benjiemc\/PythonPDB\/blob\/main\/examples\/Working%20with%20PDB%20Structures%20in%20Pandas.ipynb\">found here<\/a>!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how [&hellip;]<\/p>\n","protected":false},"author":109,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[361,14,186,228,221,227],"tags":[5,152],"ppma_author":[714],"class_list":["post-10958","post","type-post","status-publish","format-standard","hentry","category-data-science","category-howto","category-immunoinformatics","category-protein-structure","category-python","category-python-code","tag-protein-structure","tag-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":714,"user_id":109,"is_guest":0,"slug":"benjamin","display_name":"Benjamin McMaster","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d08cff80235bc80063c59072381da602325bce09b5774c2bc0db69545176f359?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"McMaster","first_name":"Benjamin","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/10958","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/109"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=10958"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/10958\/revisions"}],"predecessor-version":[{"id":10986,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/10958\/revisions\/10986"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=10958"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=10958"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=10958"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=10958"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}