Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Tools for the job

There are a couple of available tools to load PDB data into a pandas dataframe. BioPandas is an open source library for this very task and makes the process easy. I also have developed my own library, PythonPDB, that can load PDB files into pandas dataframes (I develop the library the to bridge a gap between the BioPython OOP style data model and the BioPandas dataframe data model).

Example

import nglview
view = nglview.show_file('data/6eqb.pdb')
view

Now let’s load this molecules into a pandas dataframe and do some analysis. The PDB data is loaded into columns in a similar way that PDB files are formatted as columns.

from python_pdb.parsers import parse_pdb_to_pandas

with open('data/6eqb.pdb', 'r') as fh:
    df = parse_pdb_to_pandas(fh.read())

df

To start things off, let’s clean up this structure by highlighting one of the most powerful aspects of this approach: querying. As you can seen from the data frame above- there are water molecules in the structure that we might not care about. Let’s remove them…

df_clean = df.query("record_type == 'ATOM'")  # or "residue_name != 'HOH'" would have worked as well
df_clean.tail()

By using pandas’ built in querying function- we can easily get rid of the HETATMs in the file that we might not care about. We can also use this querying to just select the TCR or pMHC to perform analysis on this molecule seperately. In this example, the TCR $\alpha$- and $\beta$ chain is labelled as chains D and E repectively. The MHC molecule is chain A (B- representing the $\beta_2$m) and C the peptide.

!head -n5 data/6eqb.pdb
REMARK   5 IMGT RENUMBERED STRUCTURE 6EQB GENERATED BY STCRDAB
REMARK   5 TCR CHAINS ARE RENUMBERED IN THE VARIABLE REGIONS ONLY
REMARK   5 MHC CHAINS ARE RENUMBERED IN THE G DOMAINS OR FOR B2M-GLOBULIN
REMARK   5 NON-TCR AND NON-MHC CHAINS ARE LEFT WITH RESIDUE IDS AS IN PDB
REMARK   5 PAIRED_ABTCR BCHAIN=E ACHAIN=D MHCCHAINS=AB AGCHAIN=C AGTYPE=PEPTIDE

tcr_df = df_clean.query("chain_id == 'D' or chain_id == 'E'")
tcr_df
mhc_df = df_clean.query("chain_id == 'A'")
mhc_df
peptide_df = df_clean.query("chain_id == 'C'")

peptide_residues_df = peptide_df.groupby(['residue_seq_id', 'residue_insert_code'], dropna=False)
print(f'The peptide is a {len(peptide_residues_df)}-mer!')
The peptide is a 9-mer!

Another advantage of using pandas is we can add new columns to annotate properties in the structure that we care about. In this example, since the TCR is from STCRDab and has been renumbered using the IMGT numbering convention we can easily identify the complimentary determining regions based on their residue_seq_id property.

IMGT_CDR1 = set(range(27, 38 + 1))
IMGT_CDR2 = set(range(56, 65 + 1))
IMGT_CDR3 = set(range(105, 117 + 1))


def assign_cdr_number(seq_id: int) -> int | None:
    '''
    Map imgt_id to CDR domains, return number associated with domain or return None if input is not in a CDR
    domain.
    '''
    if seq_id in IMGT_CDR1:
        return 1

    if seq_id in IMGT_CDR2:
        return 2

    if seq_id in IMGT_CDR3:
        return 3

    return None

tcr_df = tcr_df.copy()  # Doing this on a copy of the dataframe since it is originally a slice of df!
tcr_df['cdr'] = tcr_df['residue_seq_id'].map(assign_cdr_number)

We can also annotations for the α- and β-chain since these are defined by the STCRDab header.

tcr_df['chain_type'] = tcr_df['chain_id'].map(lambda chain_id: 'alpha' if chain_id == 'D' else 'beta')

Now we can easily get rich information about the TCR CDR loops with ease.

tcr_cdrs_df = tcr_df.query('cdr.notnull()')

cdr_lengths = tcr_cdrs_df[
    ['chain_type', 'cdr', 'residue_seq_id', 'residue_insert_code']
].drop_duplicates().groupby(['chain_type', 'cdr'], dropna=False).size()

cdr_lengths
chain_type  cdr
alpha       1.0     6
            2.0     6
            3.0     9
beta        1.0     6
            2.0     5
            3.0    13
dtype: int64

We can also easily properties such as b-factors in the TCR variable domain.

import seaborn as sns
tcr_variable_df = tcr_df.query('residue_seq_id <= 129')
sns.lineplot(x=tcr_variable_df['residue_seq_id'], y=tcr_variable_df['b_factor'])

Finally, computing things like contacting residues between the TCR CDR loops and the peptide is a breeze.

import numpy as np
CONTACT_DISTANCE = 5  # Angstroms (Å)

def euclidean_distance(x1, y1, z1, x2, y2, z2):
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2  + (z2 - z1)**2)


interface = tcr_cdrs_df.merge(peptide_df, how='cross', suffixes=('_tcr', '_peptide'))
interface['atom_distances'] = euclidean_distance(interface['pos_x_tcr'], interface['pos_y_tcr'], interface['pos_z_tcr'],
                                                 interface['pos_x_peptide'], interface['pos_y_peptide'], interface['pos_z_peptide'])

contacting_atoms = interface[interface['atom_distances'] <= CONTACT_DISTANCE]
contacting_residues = contacting_atoms[['chain_id_tcr', 'residue_seq_id_tcr', 'residue_insert_code_tcr', 'cdr', 'chain_type',
                                        'residue_seq_id_peptide', 'residue_insert_code_peptide']].drop_duplicates()

contacting_residues

The full notebook can be found here!

Author