Addressing the Role of Conformational Diversity in Protein Structure Prediction

For my journal club last week, I chose to look at a recent paper entitled “Addressing the Role of Conformational Diversity in Protein Structure Prediction”, by Palopoli et al [1]. In the study of proteins, structures are incredibly useful tools, offering information about how they carry out their function, and allowing informed decisions to be made in many areas (e.g. drug design). Since the experimental determination is difficult, however, the computational prediction of protein structures has become very important (and a number of us here at OPIG work on this!).

A problem, however, in both experimental structure determination and computational structure prediction, is that proteins are generally treated as static – the output of an X-ray crystallography experiment is a single structure, and in the majority of cases the goal of structure prediction is to produce one model that closely resembles the native structure. The accuracy of structure prediction algorithms is also normally measured by comparing the resulting model to a single, known experimentally-determined structure. The issue here is that proteins are not static – they are constantly moving and may adopt a number of different conformations; the structure observed experimentally is just a snapshot of that motion. The dynamics of a protein may even play an important role in its function; an example is haemoglobin, which after binding to oxygen changes conformation to increase affinity for further binding. It may be more appropriate, then, to represent a protein as an ensemble of structures, and not just one.

Conformational diversity helps the protein haemoglobin carry out its function (the transportation of oxygen in the blood). Haemoglobin has four subunits, each containing a haem group, shown in red. When oxygen binds to this group (blue), a histidine residue moves, shifting the position of an alpha helix. This movement is propagated throughout the entire structure, and increases the affinity for oxygen of the other subunits – binding therefore becomes increasingly easy (this is known as co-operative binding). Gif shown is from the PDB-101 Molecule of the Month series: S. Dutta and D. Goodsell, doi:10.2210/rcsb_pdb/mom_2003_5

How, though, could this be incorporated into protein structure prediction? This is the question being considered by the authors of this paper. They consider conformational diversity by looking at different conformers of the same protein – there are many proteins whose structures have been solved experimentally multiple times, and as such have a number of structures available in the PDB. Information about this is stored in a useful database called CoDNaS [2], which was developed by some of the authors of the paper under discussion. In some cases, there are model (or decoy) structures available for these proteins, generated by various structure prediction algorithms – for example, all models submitted for the CASP experiments [3], where the current accuracy of structure prediction is monitored through blind prediction, are freely available for download. The authors curated a collection of decoy sets for 91 different proteins for which multiple experimental structures are present in the PDB.

As mentioned previously, the accuracy of a model is normally evaluated by measuring its structural similarity to one known (or reference) structure – only one conformer of the protein is considered. The authors show that the model rankings achieved by this are highly dependent on the chosen reference structure. If the possible choices (i.e. the observed conformers) are quite similar the effect is small, but if there is a large difference, then two completely different decoys could be designated as the most accurate depending on which reference structure is used.

The key figure from this paper, in my opinion, is the one shown below. For the two most dissimilar experimentally-observed conformers for each protein in the set, the RMSD of the best decoy in relation to one conformer is plotted against the RMSD of the best decoy when measured against the other:

The straight line on this graph indicates what would be observed if there are decoys in the set that equally represent the two conformers; for example, if the best decoy with reference to conformer 1 has an RMSD of 1 Å, then there is also a decoy that is 1 Å away from conformer 2. Most points are on or near this line – this means that the sets of decoy structures are not biased towards one of the conformers. Therefore, structure prediction algorithms seem to be able to generate models for multiple conformations of proteins, and so the production of an ensemble of models is not an impossible dream. Several obstacles remain, however – although of equal distance to both conformers, the decoys could still be of poor quality; and decoy selection is often inaccurate, and so finding these multiple conformations amongst all others is a challenge.

[1] – Palopoli, N., Monzon, A. M., Parisi, G., and Fornasari, M. S. (2016). Addressing the Role of Conformational Diversity in Protein Structure Prediction. PLoS One, 11, e0154923.

[2] – Monzon, A. M., Juritz, E., Fornasari, S., and Parisi, G. (2013). CoDNaS: a database of conformational diversity in the native state of proteins. Bioinformatics, 29, 2512–2514.

[3] – Moult, J., Pedersen, J. T., Judson, R., and Fidelis, K. (1995). A Large-Scale Experiment to Assess Protein Structure Prediction Methods. Proteins, 23, ii–iv.

Transgenic Mosquitoes

At the meeting on November 15 I have covered a paper by Gantz et al. describing a method for creating transgenic mosquitoes expressing antibodies hindering the development of malaria parasites.

The immune system is commonly divided into two categories: innate and adaptive. The innate immune system consists of non-specific defence mechanisms such as epithelial barriers, macrophages etc. The innate system is present in virtually every living organism. The adaptive immune system is responsible for invader-specific defence response. Is consists of B and T lymphocytes and encompasses antibody production. As only vertebrates posses the adaptive immune system, mosquitoes do not naturally produce antibodies which hinders their ability to defend themselves against pathogens such as malaria.

In the study by Gantz et al. the authors inserted transgenes expressing three single-chain Fvs (m4B7, m2A10 and m1C3) into the previously-characterised chromosomal docking sites.

Figure 1: The RT-PCR experiments showing the scFv expression in different mosquito strains

RT-PCR was used to detect scFv transcripts in RNA isolated from the transgenic mosquitoes (see Figure 1). The experiments showed that the attP 44-C recipient line allowed expression of the transgenes coding for the scFvs.

The authors evaluated the impact of the modifications on the fitness of the mosquitoes. It was shown that the transgene expression does not reduce the lifespan of the mosquitoes, or their ability to procreate.

Expression of the scFvs targeted the parasite at both the early and late development stages. The transgenic mosquitoes displayed a significant reduction in the number of malaria sporozoites per infected female, in most cases completely inhibiting the sporozoite development.

Overall the study showed that it is possible to develop transgenic mosquitoes that are resistant to malaria. If this method was combined with a mechanism for a gene spread, the malaria-resistant mosquitoes could be released into the environment, helping to fight the spread of this disease.

Interesting Antibody Papers

De Novo H3 prediction by C-terminal kink-biasing (Gray Lab) [here].

Authors introduce an improvement to the prediction of CDR-H3 in the form of a constraint for de-novo decoy generation. Working from the observation that 80% of CDR-H3 have kinked C-Terminal (Weitzner et al., 2015, Structure), they bias the loops to assume this conformation (they prove that it does not force ALL loops to do so!). The constraint is in the form of a pseudo bond angle between Ca for the three C-terminal residues and a pseudo dihedral angle for the three C-terminal residues and one adjacent residue in the framework. The bias takes the form of a penalty score if the generated angle falls outside mean +/- 1s. They use a quite stringent H3 loop benchmark of only 49 loops. Using this constraint on this dataset improves prediction for majority of the loops. They also demonstrate the utility of the score for full Fv homology modeling and Ab-Ag docking.

Therapeutic vs synthetic vs natural antibodies (Ofran Lab) [here].

The authors analyzed 137 Ab-Ag complexes from the PDB. Those from hybridoma and synthetic libraries were classified as ‘Natural’ and those coming from ‘synthetic’ libraries. They demonstrate that synthetic libraries overuse H3 in the number of contacts the antibody forms with the antigen, whereas natural constructs share the paratope with H1& H2 to a larger extent. This, together with their tool, CDRs analyzer (analysis of structural & biochemical properties of ab-ag complex) can be a useful method to inform the design of antibodies.

From the past: TABHU, tools for antibody humanization (Tramontano Lab) [here]. Authors have created a tool to aid antibody humanization. Given a sequence of an antibody, the system would look for the most suitable template from their extensive sequence databases (DIGIT) and germline sequences from IMGT. The templates are assessed on sequence similarity to the query and the similarity of the ‘binding’ mode which is assessed by their paratope prediction tool proABC. After the template had been chosen, the user can produce a structural model of the sequence.

The Emerging Disorder-Function Paradigm

It’s rare to find a paper that connects all of the diverse areas of research of OPIG, but “The rules of disorder or why disorder rules” by Gsponer and Babu (2009) is one such paper. Protein folding, protein-protein interaction networks, protein loops (Schlessinger et al., 2007), and drug discovery all play a part in this story. What’s great about this paper is that it gives numerous examples of proteins and the evidence supporting that they are partially or completely unstructured. These are the so-called intrinsically unstructured proteins or IUPs, although more recently they are also being referred to as intrinsically disordered proteins, or IDPs. Intrinsically disordered regions (IDRs) “are polypeptide segments that do not contain sufficient hydrophobic amino acids to mediate co-operative folding” (Babu, 2016).

Such proteins contradict the classic “lock and key” hypothesis of Fischer, and challenge Continue reading →

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

Calculating the interactions formed by a small molecule from the target
Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 
plipcmd -f BRD1A-m004.pdb -o m004 -x
plipcmd -f BRD1A-m006.pdb -o m006 -x
plipcmd -f BRD1A-m009.pdb -o m009 -x

A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.



import xml.etree.ElementTree as ET
################################################################################
def generate_plif_lists(report_file, residue_list, lig_ident):
    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site
        plif_list_all = []
        tree = ET.parse(report_file)
        root = tree.getroot()
        #list of residue keys that form an interaction
        for binding_site in root.findall('bindingsite'):
                nest = binding_site.find('identifiers')
                lig_code = nest.find('hetid')
                if str(lig_code.text) == str(lig_ident):
                        #get the plifs stuff here
                        nest_residue = binding_site.find('bs_residues')
                        residue_list_tree = nest_residue.findall('bs_residue')
                        for residue in residue_list_tree:
                                res_id = residue.text
                                dict_res_temp = residue.attrib
                                if res_id not in residue_list:
                                        residue_list.append(res_id)
                                if dict_res_temp['contact'] == 'True':
                                        if res_id not in plif_list_all:
                                                plif_list_all.append(res_id)
        return plif_list_all, residue_list
###############################################################################
plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')
plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')
plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')

3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs
from rdkit.DataStructs import cDataStructs
################################################################################
def generate_rdkit_plif(residue_list, plif_list_all):
    #generates RDKit plif given list of residues in binding site and list of interacting residues
    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)
    for index, res in enumerate(residue_list):
        if res in plif_list_all:
            print 'here'
            plif_rdkit.SetBit(index)
        else:
            continue
    return plif_rdkit
#########################################################################
plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)
plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)
plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)

4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):
    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)
    print sim
    return sim 
###################################################################
print similarity_plifs(plif_m006, plif_m009)
print similarity_plifs(plif_m006, plif_m004)
print similarity_plifs(plif_m009, plif_m004)

The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

End of an era?

The Era of Crystallography ends…

For over 100 years, crystallography has been used to determine the atom arrangements of molecules; specifically, it has become the workhorse of routine macromolecular structure solution, being responsible for over 90% of the atomic structures in the PDB. Whilst this achievement is impressive, in some ways it has come around despite the crystallographic method, rather than because of it…

The problem, generally, is this: to perform crystallography, you need crystals. Crystals require the spontaneous assembly of billions of molecules into a regular repeated arrangement. For proteins — large, complex, irregularly shaped molecules — this is not generally a natural state for them to exist in, and getting a protein to crystallise can be a difficult process (the notable exception is Lysozyme, which it is difficult NOT to crystallise, and there are subsequently currently ~1700 crystal structures of it in the PDB). Determining the conditions under which proteins will crystallise requires extensive screening: placing the protein into a variety of difference solutions, in the hope that in one of these, the protein will spontaneously self-assemble into (robust, homogeneous) crystals. As for membrane proteins, which… exist in membranes, crystallisation solutions are sort of ridiculous (clever, but ridiculous).

But even once a crystal is obtained (and assuming it is a “good” well-diffracting crystal), diffraction experiments alone are generally not enough to determine the atomic structure of the crystal. In a crystallographic experiment, only half of the data required to solve the structure of the crystal is measured — the amplitudes. The other half of the data — the phases — are not measured. This constitutes the “phase problem” of crystallography, and “causes some problems”: developing methods to solve the phase problem is essentially a field of its own.

…and the Era of Cryo-Electron Microscopy begins

Cryo-electron microscopy (cryo-EM; primers here and here), circumnavigates both of the problems with crystallography described above (although of course it has some of its own). Single-particles of the protein (or protein complex) are deposited onto grids and immobilised, removing the need for crystals altogether. Furthermore, the phases can be measured directly, removing the need to overcome the phase problem.

Cryo-EM is also really good for determining the structures of large complexes, which are normally out of the reach of crystallography, and although cryo-EM structures used to only be determined at low resolution, this is changing quickly with improved experimental hardware.

Cryo-Electron Microscopy is getting better and better every day. For structural biologists, it seems like it’s going to be difficult to avoid it. However, for crystallographers, don’t worry, there is hope.

Start2Fold: A database of protein folding and stability data

Hydrogen/deuterium exchange (HDX) experiments are used to probe the tertiary structures and folding pathways of proteins. The rate of proton exchange between a given residue’s backbone amide proton and the surrounding solvent depends on the solvent exposure of the residue. By refolding a protein under exchange conditions, these experiments can identify which regions quickly become solvent-inaccessible, and which regions undergo exchange for longer, providing information about the refolding pathway.

Although there are many examples of individual HDX experiments in the literature, the heterogeneous nature of the data has deterred comprehensive analyses. Start2Fold (Start2Fold.eu) [1] is a curated database that aims to present protein folding and stability data derived from solvent-exchange experiments in a comparable and accessible form. For each protein entry, residues are classified as early/intermediate/late based on folding data, or strong/medium/weak based on stability data. Each entry includes the PDB code, length, and sequence of the protein, as well as details of the experimental method. The database currently includes 57 entries, most of which have both folding and stability data. Hopefully, this database will grow as scientists add their own experimental data, and reveal useful information about how proteins refold.

The folding data available in Start2Fold is visualised in the figure below, with early, intermediate and late folding residues coloured light, medium and dark blue, respectively.

[1] Pancsa, R., Varadi, M., Tompa, P., Vranken, W.F., 2016. Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability. Nucleic Acids Res. 44, D429-34.

A global genetic interaction network maps a wiring diagram of cellular function

In our last group meeting, I talked about a recent paper which presents a vast amount of genetic interaction data as well as some spatial analysis of the created data. Constanzo et al. used temperature-sensitive mutant alleles to measure the interaction of ~6000 genes in the yeast Saccharomyces cerevisiae [1]. A typical way to analyse such data would be the use of community detection to find groups of genes with similar interaction pattern, see for example [2] for a review. Instead, the authors of this paper created a two-dimensional embedding of the network with a spring-layout, which places nodes close to each other if they show similar interaction pattern.

The network layout is then compared with Gene Ontology by applying a spatial analysis of functional enrichment (SAFE) [3]. Clusters enriched are associated for example with cell polarity, protein degradation, and ribosomal RNA. By filtering the network for different similarities they find a hierarchical organisation of genetic function with small dense modules of pathways or complexes at the bottom and sparse clusters representing different cell compartments at the top.

In this extensive paper, they then go further into detail to quantify gene pleiotropy, predict gene function, and how the interaction structure differs between essential and non-essential genes. They also provide that data online under http://thecellmap.org/costanzo2016/ .

(Left) A network of genetic interaction was embedded into a two-dimensional space using a spring-layout (Right) This embedding was compared with Gene Ontology terms to find regions of spatial enrichment.
Image from [1]

References:

[1] Costanzo, Michael, et al. “A global genetic interaction network maps a wiring diagram of cellular function.” Science 353.6306 (2016): aaf1420.

[2] Fortunato, Santo. “Community detection in graphs.” Physics Reports 486.3 (2010): 75-174.

[3] Baryshnikova, Anastasia. “Systematic Functional Annotation and Visualization of Biological Networks.” Cell Systems (2016).

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

support for pdb, sdf, mol2, xyz, and cube formats
parallelized molecular surface computation
sphere, stick, line, cross, cartoon, and surface styles
atom property based selection and styling
labels
clickable interactivity with molecular data
geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

SMEs in Research

Scientific research relies on collaboration, between academics across the world, between big and small groups of people, and between companies and universities. Many people’s first thoughts of companies involved in academia will be large multinational corporations, such as pharmaceutical or aerospace companies. However mutually beneficial relationships exist between smaller companies and academia. Scientific spin out companies, such as Oxford Nanopore Technologies, Theta Technologies or Reinnervate, are one way research is expanded from an idea held by researchers at a university and expanded into a commercial product or service. However, there are many other ways and reasons for scientists collaborating with small companies.

Small and medium sized enterprises (SMEs) are independent businesses with up to 249 people, they represent over 99% of all UK and EU businesses, and 45-50% of turnover and employment. But why would researchers be interested in involvement with SMEs? Access to unique intellectual property and innovation in the SMEs, as well as access to funding targeted at fostering research led projects in SMEs. Innovate UK, support growth by enabling and funding innovative opportunities. SMEs can access this support through Knowledge Transfer Partnerships (KTP), a three-way partnership between a graduate, academic institution and a business. These KTPs can last between 12 months and 3 years, and provide financial support for academics to monitor the graduate’s work. KTP projects lead to an average of 2 papers per project, and are attractive to SMEs as they only fund 33% of the project cost.

Impact of research outside academia is becoming increasingly important to identify as impact case studies form a part of the Research Excellence Framework (REF), making many sources of funding contingent on providing a strong case for the wider impact of the research. Work with innovation focused SMEs can provide a focus for research impact in engagement and economic terms. For example, a REF case study highlighting the impact of spintronics research in the development of non-contact sensors, through a spin out company, Salunda.

Oxford Protein Informatics Group

or "OPIG" to friends