Monthly Archives: November 2016

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

  1. Calculating the interactions formed by a small molecule from the target
  2. Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

brd1_sgc

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 

plipcmd -f BRD1A-m004.pdb -o m004 -x

plipcmd -f BRD1A-m006.pdb -o m006 -x

plipcmd -f BRD1A-m009.pdb -o m009 -x

 


A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.

 


import xml.etree.ElementTree as ET

################################################################################

def generate_plif_lists(report_file, residue_list, lig_ident):

    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site

        plif_list_all = []

        tree = ET.parse(report_file)

        root = tree.getroot()

        #list of residue keys that form an interaction

        for binding_site in root.findall('bindingsite'):

                nest = binding_site.find('identifiers')

                lig_code = nest.find('hetid')

                if str(lig_code.text) == str(lig_ident):

                        #get the plifs stuff here

                        nest_residue = binding_site.find('bs_residues')

                        residue_list_tree = nest_residue.findall('bs_residue')

                        for residue in residue_list_tree:

                                res_id = residue.text

                                dict_res_temp = residue.attrib

                                if res_id not in residue_list:

                                        residue_list.append(res_id)

                                if dict_res_temp['contact'] == 'True':

                                        if res_id not in plif_list_all:

                                                plif_list_all.append(res_id)

        return plif_list_all, residue_list

###############################################################################

plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')

plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')

plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')


3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs

from rdkit.DataStructs import cDataStructs

################################################################################

def generate_rdkit_plif(residue_list, plif_list_all):

    #generates RDKit plif given list of residues in binding site and list of interacting residues

    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)

    for index, res in enumerate(residue_list):

        if res in plif_list_all:

            print 'here'

            plif_rdkit.SetBit(index)

        else:

            continue

    return plif_rdkit

#########################################################################

plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)

plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)

plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)


4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):

    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)

    print sim

    return sim

###################################################################

print similarity_plifs(plif_m006, plif_m009)

print similarity_plifs(plif_m006, plif_m004)

print similarity_plifs(plif_m009, plif_m004)


The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

End of an era?

The Era of Crystallography ends…

For over 100 years, crystallography has been used to determine the atom arrangements of molecules; specifically, it has become the workhorse of routine macromolecular structure solution, being responsible for over 90% of the atomic structures in the PDB. Whilst this achievement is impressive, in some ways it has come around despite the crystallographic method, rather than because of it…

The problem, generally, is this: to perform crystallography, you need crystals. Crystals require the spontaneous assembly of billions of molecules into a regular repeated arrangement. For proteins — large, complex, irregularly shaped molecules — this is not generally a natural state for them to exist in, and getting a protein to crystallise can be a difficult process (the notable exception is Lysozyme, which it is difficult NOT to crystallise, and there are subsequently currently ~1700 crystal structures of it in the PDB). Determining the conditions under which proteins will crystallise requires extensive screening: placing the protein into a variety of difference solutions, in the hope that in one of these, the protein will spontaneously self-assemble into (robust, homogeneous) crystals. As for membrane proteins, which… exist in membranes, crystallisation solutions are sort of ridiculous (clever, but ridiculous).

But even once a crystal is obtained (and assuming it is a “good” well-diffracting crystal), diffraction experiments alone are generally not enough to determine the atomic structure of the crystal. In a crystallographic experiment, only half of the data required to solve the structure of the crystal is measured — the amplitudes. The other half of the data — the phases — are not measured. This constitutes the “phase problem” of crystallography, and “causes some problems”: developing methods to solve the phase problem is essentially a field of its own.

…and the Era of Cryo-Electron Microscopy begins

Cryo-electron microscopy (cryo-EM; primers here and here), circumnavigates both of the problems with crystallography described above (although of course it has some of its own). Single-particles of the protein (or protein complex) are deposited onto grids and immobilised, removing the need for crystals altogether. Furthermore, the phases can be measured directly, removing the need to overcome the phase problem.

Cryo-EM is also really good for determining the structures of large complexes, which are normally out of the reach of crystallography, and although cryo-EM structures used to only be determined at low resolution, this is changing quickly with improved experimental hardware.

Cryo-Electron Microscopy is getting better and better every day. For structural biologists, it seems like it’s going to be difficult to avoid it. However, for crystallographers, don’t worry, there is hope.

Start2Fold: A database of protein folding and stability data

Hydrogen/deuterium exchange (HDX) experiments are used to probe the tertiary structures and folding pathways of proteins. The rate of proton exchange between a given residue’s backbone amide proton and the surrounding solvent depends on the solvent exposure of the residue. By refolding a protein under exchange conditions, these experiments can identify which regions quickly become solvent-inaccessible, and which regions undergo exchange for longer, providing information about the refolding pathway.

Although there are many examples of individual HDX experiments in the literature, the heterogeneous nature of the data has deterred comprehensive analyses. Start2Fold (Start2Fold.eu) [1] is a curated database that aims to present protein folding and stability data derived from solvent-exchange experiments in a comparable and accessible form. For each protein entry, residues are classified as early/intermediate/late based on folding data, or strong/medium/weak based on stability data. Each entry includes the PDB code, length, and sequence of the protein, as well as details of the experimental method. The database currently includes 57 entries, most of which have both folding and stability data. Hopefully, this database will grow as scientists add their own experimental data, and reveal useful information about how proteins refold.

The folding data available in Start2Fold is visualised in the figure below, with early, intermediate and late folding residues coloured light, medium and dark blue, respectively.

start2foldpng

[1] Pancsa, R., Varadi, M., Tompa, P., Vranken, W.F., 2016. Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability. Nucleic Acids Res. 44, D429-34.