Checking your PDB file for clashing atoms

Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!

from Bio import PDB
import numpy as np

parser = PDB.PDBParser(QUIET=True)
downloader = PDB.PDBList()

# Atomic radii for various atom types. 
# You can comment out the ones you don't care about or add new ones
atom_radii = {
#    "H": 1.20,  # Who cares about hydrogen??
    "C": 1.70, 
    "N": 1.55, 
    "O": 1.52,
    "S": 1.80,
    "F": 1.47, 
    "P": 1.80, 
    "CL": 1.75, 
    "MG": 1.73,
}

def count_clashes(structure, clash_cutoff=0.63):
    # Set what we count as a clash for each pair of atoms
    clash_cutoffs = {i + "_" + j: (clash_cutoff * (atom_radii[i] + atom_radii[j])) for i in atom_radii for j in atom_radii}

    # Extract atoms for which we have a radii
    atoms = [x for x in structure.get_atoms() if x.element in atom_radii]
    coords = np.array([a.coord for a in atoms], dtype="d")

    # Build a KDTree (speedy!!!)
    kdt = PDB.kdtrees.KDTree(coords)

    # Initialize a list to hold clashes
    clashes = []

    # Iterate through all atoms
    for atom_1 in atoms:
        # Find atoms that could be clashing
        kdt_search = kdt.search(np.array(atom_1.coord, dtype="d"), max(clash_cutoffs.values()))

        # Get index and distance of potential clashes
        potential_clash = [(a.index, a.radius) for a in kdt_search]

        for ix, atom_distance in potential_clash:
            atom_2 = atoms[ix]

            # Exclude clashes from atoms in the same residue
            if atom_1.parent.id == atom_2.parent.id:
                continue

            # Exclude clashes from peptide bonds
            elif (atom_2.name == "C" and atom_1.name == "N") or (atom_2.name == "N" and atom_1.name == "C"):
                continue

            # Exclude clashes from disulphide bridges
            elif (atom_2.name == "SG" and atom_1.name == "SG") and atom_distance > 1.88:
                continue

            if atom_distance < clash_cutoffs[atom_2.element + "_" + atom_1.element]:
                clashes.append((atom_1, atom_2))

    return len(clashes) // 2

ChatGPT explains my code

I gave ChatGPT my code and asked it to explain. I think it does a good enough job!!!

ChatGPT explanation:

The count_clashes function analyses a PDB structure and determines the number of atom clashes present. It takes the structure and an optional clash cutoff value as input. The clash cutoff values are established for each pair of atom types based on their van der Waals radii. An atom pair is considered clashing if their distance is within a specified value times the sum of their radii.

To begin, the function extracts the coordinates of atoms with defined radii, filtering them based on their element type. A KDTree, a spatial searching data structure, is then constructed using the coordinates of the filtered atoms. This enables efficient searching.

Next, the function iterates through each atom in the filtered atom list. For each atom, a KDTree search is performed to identify potential clashes with other atoms. The function applies several conditions to exclude false positives: clashes within the same residue, clashes involving peptide bonds (C-N or N-C), and clashes between atoms participating in disulphide bridges (SG-SG) beyond a specific distance threshold.

If the distance between two atoms falls below the clash cutoff value for their respective atom types, they are considered a clash and added to the clashes list. Finally, the function returns the count of clashes divided by 2. This adjustment is necessary because each clash is detected twice (once for atom A with atom B and vice versa), providing an accurate count of clashes.

Feel free to customise the function according to your specific requirements, such as calculating contacts between protein chains or incorporating additional criteria to identify specific types of interactions.

Usage

Let’s consider an example where we have downloaded a PDB file with the ID “1GIG” using the PDB downloader. We want to analyse this structure and check for atom clashes using the count_clashes function. Here’s an example usage:

file = downloader.retrieve_pdb_file("1gig", pdir=".", file_format="pdb")
structure = parser.get_structure(file, file)
num_clashes = count_clashes(structure, clash_cutoff=0.63)
print("Number of atom clashes:", num_clashes)

When running the code, it will display the number of atom clashes detected in the structure:

Number of atom clashes: 5

So there you go!! A very well explained function to count the number of clashing atoms in a PDB file. Many thanks to my bestie ChatGPT for helping me out!!! If you have read this far I hope this was not a complete waste of your time.

Author