Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Network Representations of Allostery

Allostery is the process by which action at one site, such as the binding of an effector molecule, causes a functional effect at a distant site. Allosteric mechanisms are important for the regulation of cellular processes, altering the activity of a protein, or the whole biosynthetic pathway. Triggers for allosteric action include binding of small molecules, protein-protein interaction, phosphorylation events and modification of disulphide bonds. These triggers can lead to changes in accessibility of the active site, through large or small motions, such as hinge motion between two domains, or the motion of a single side chain.

Figure 1 from [1]: Rearrangement of a residue–residue interaction in phosphofructokinase. Left panel: interaction between E241 and H160 of chain A in the inactive state; right: this interaction in the active state. Red circles mark six atoms unique to the residue–residue interface in the I state, green circles mark four atoms unique to the A state, and yellow circles mark three atoms present in both states. In these two residues, there are a total of 19 atoms, so the rearrangement factor R(i,j) = max(6, 4)/19 = 0.32

One way to consider allostery is as signal propagation from one site to another, as a change in residue to residue contacts. Networks provide a way to represent these changes. Daily et al [1] introduce the idea of contact rearrangement networks, constructed from a local comparison of the protein structure with and without molecules bound to the allosteric site. These are referred to as the active and inactive structures respectively. To measure the whether a residue to residue contact is changed between the active and inactive states, the authors use a rearrangement factor (R(i,j)). This is the ratio of atoms which are within a threshold distance (5 angstroms) in only one of the active or inactive states (whichever is greater), to the total number of atoms in the two residues.The rearrangement factor is distributed such that the large majority of residues have low rearrangement factors (as they do not change between the active and inactive state). To consider when a rearrangement is significant the authors use a benchmark set of non allosteric proteins to set a threshold for the rearrangement factor. The residues above this threshold form the contact rearrangement network, which can be analysed to assess whether the allosteric and functional sites are linked by residue to residue contacts. In the paper 5/15 proteins analysed are found to have linked functional and allosteric sites.

Adaption of Figure 2 from [1]. Contact rearrangement network for phosphofructokinase. Circles in each graph represent protein residues, and red and green squares represent substrate and effector molecules, respectively. Lines connect pairs of residues with R(i,j) ≥ 0.3 and residues in the graph with any ligands which are adjacent (within 5.0 Å) in either structure. All connected components which include at least one substrate or effector molecule are shown.

Collective rigid body domain motion was not initially analysed by these contact rearrangement networks, however a later paper [2], discusses how considering these motions alongside the contact rearrangement networks can lead to a detection of allosteric activity in a greater number of proteins analysed. These contact rearrangement networks provide a way to assess the residues that are likely to be involved in allosteric signal propagation. However this requires a classification of allosteric and non-allosteric proteins, to undertake the thresholding for significance of the change in contacts, as well as multiple structures that have and do not have a allosteric effector molecule bound.

Figure 1 from [3]. (a) X-ray electron density map contoured at 1σ (blue mesh) and 0.3σ (cyan mesh) of cyclophilin A (CYPA) fit with discrete alternative conformations using qFit. Alternative conformations are colored red, orange or yellow, with hydrogen atoms added in green. (b) Visualizing a pathway in CYPA: atoms involved in clashes are shown in spheres scaled to van der Waals radii, and clashes between atoms are highlighted by dotted lines. This pathway originates with the OG atom of Ser99 conformation A (99A) and the CE1 atom of Phe113 conformation B (113B), which clash to 0.8 of their summed van der Waals radii. The pathway progresses from Phe113 to Gln63, and after the movement of Met61 to conformation B introduces no new clashes, the pathway is terminated. A 90° rotation of the final panel is shown to highlight how the final move of Met61 relieves the clash with Gln63. (c) Networks identified by CONTACT are displayed as nodes connected by edges representing contacts that clash and are relieved by alternative conformations. The node number represents the sequence number of the residue. Line thickness between a pair of nodes represents the number of pathways that the corresponding residues are part of. The pathway in b forms part of the red contact network in CYPA. (d) The six contact networks comprising 29% of residues are mapped on the three-dimensional structure of CYPA.

Alternatively, Van den Bedem et al [3] define contact networks of conformationally coupled residues, in which movement of an alternative conformation of a residue likely influences the conformations of all other residues in the contact network. They utilise qFit, a tool for exploring conformational heterogeneity in a single electron density map of a protein, by fitting alternate conformations to the electron density. For each conformation of a residue, it assesses whether it is possible to reduce steric clashes with another residue, by changing conformations. If a switch in conformations reduces steric clashes, then a pathway is extend to the neighbours of the residue that is moved. This continued until no new clashes are introduced. Pathways that share common members are considered as conformationally coupled, and grouped into a single contact network. As this technique is suitable for a single structure, it is possible to estimate residues which may be involved in allosteric signalling without prior knowledge of the allosteric binding region.

These techniques show two different ways to locate and annotate local conformational changes in a protein, and determine how they may be linked to one another. Considering whether these, and similar techniques highlight the same allosteric networks within proteins will be important in the integration of many data types and sources to inform the detection of allostery. Furthermore, the ability to compare networks, for example finding common motifs, will be important as the development of techniques such as fragment based drug discovery present crystal structures with many differently bound fragments.

[1] Daily, M. D., Upadhyaya, T. J., & Gray, J. J. (2008). Contact rearrangements form coupled networks from local motions in allosteric proteins. Proteins: Structure, Function and Genetics. http://doi.org/10.1002/prot.21800

[2] Daily, M. D., & Gray, J. J. (2009). Allosteric communication occurs via networks of tertiary and quaternary motions in proteins. PLoS Computational Biology. http://doi.org/10.1371/journal.pcbi.1000293

[3] van den Bedem, H., Bhabha, G., Yang, K., Wright, P. E., & Fraser, J. S. (2013). Automated identification of functional dynamic contact networks from X-ray crystallography. Nature Methods, 10(9), 896–902. http://doi.org/10.1038/nmeth.2592

Addressing the Role of Conformational Diversity in Protein Structure Prediction

For my journal club last week, I chose to look at a recent paper entitled “Addressing the Role of Conformational Diversity in Protein Structure Prediction”, by Palopoli et al [1]. In the study of proteins, structures are incredibly useful tools, offering information about how they carry out their function, and allowing informed decisions to be made in many areas (e.g. drug design). Since the experimental determination is difficult, however, the computational prediction of protein structures has become very important (and a number of us here at OPIG work on this!).

A problem, however, in both experimental structure determination and computational structure prediction, is that proteins are generally treated as static – the output of an X-ray crystallography experiment is a single structure, and in the majority of cases the goal of structure prediction is to produce one model that closely resembles the native structure. The accuracy of structure prediction algorithms is also normally measured by comparing the resulting model to a single, known experimentally-determined structure. The issue here is that proteins are not static – they are constantly moving and may adopt a number of different conformations; the structure observed experimentally is just a snapshot of that motion. The dynamics of a protein may even play an important role in its function; an example is haemoglobin, which after binding to oxygen changes conformation to increase affinity for further binding. It may be more appropriate, then, to represent a protein as an ensemble of structures, and not just one.

Conformational diversity helps the protein haemoglobin carry out its function (the transportation of oxygen in the blood). Haemoglobin has four subunits, each containing a haem group, shown in red. When oxygen binds to this group (blue), a histidine residue moves, shifting the position of an alpha helix. This movement is propagated throughout the entire structure, and increases the affinity for oxygen of the other subunits – binding therefore becomes increasingly easy (this is known as co-operative binding). Gif shown is from the PDB-101 Molecule of the Month series: S. Dutta and D. Goodsell, doi:10.2210/rcsb_pdb/mom_2003_5

How, though, could this be incorporated into protein structure prediction? This is the question being considered by the authors of this paper. They consider conformational diversity by looking at different conformers of the same protein – there are many proteins whose structures have been solved experimentally multiple times, and as such have a number of structures available in the PDB. Information about this is stored in a useful database called CoDNaS [2], which was developed by some of the authors of the paper under discussion. In some cases, there are model (or decoy) structures available for these proteins, generated by various structure prediction algorithms – for example, all models submitted for the CASP experiments [3], where the current accuracy of structure prediction is monitored through blind prediction, are freely available for download. The authors curated a collection of decoy sets for 91 different proteins for which multiple experimental structures are present in the PDB.

As mentioned previously, the accuracy of a model is normally evaluated by measuring its structural similarity to one known (or reference) structure – only one conformer of the protein is considered. The authors show that the model rankings achieved by this are highly dependent on the chosen reference structure. If the possible choices (i.e. the observed conformers) are quite similar the effect is small, but if there is a large difference, then two completely different decoys could be designated as the most accurate depending on which reference structure is used.

The key figure from this paper, in my opinion, is the one shown below. For the two most dissimilar experimentally-observed conformers for each protein in the set, the RMSD of the best decoy in relation to one conformer is plotted against the RMSD of the best decoy when measured against the other:

The straight line on this graph indicates what would be observed if there are decoys in the set that equally represent the two conformers; for example, if the best decoy with reference to conformer 1 has an RMSD of 1 Å, then there is also a decoy that is 1 Å away from conformer 2. Most points are on or near this line – this means that the sets of decoy structures are not biased towards one of the conformers. Therefore, structure prediction algorithms seem to be able to generate models for multiple conformations of proteins, and so the production of an ensemble of models is not an impossible dream. Several obstacles remain, however – although of equal distance to both conformers, the decoys could still be of poor quality; and decoy selection is often inaccurate, and so finding these multiple conformations amongst all others is a challenge.

[1] – Palopoli, N., Monzon, A. M., Parisi, G., and Fornasari, M. S. (2016). Addressing the Role of Conformational Diversity in Protein Structure Prediction. PLoS One, 11, e0154923.

[2] – Monzon, A. M., Juritz, E., Fornasari, S., and Parisi, G. (2013). CoDNaS: a database of conformational diversity in the native state of proteins. Bioinformatics, 29, 2512–2514.

[3] – Moult, J., Pedersen, J. T., Judson, R., and Fidelis, K. (1995). A Large-Scale Experiment to Assess Protein Structure Prediction Methods. Proteins, 23, ii–iv.

The Emerging Disorder-Function Paradigm

It’s rare to find a paper that connects all of the diverse areas of research of OPIG, but “The rules of disorder or why disorder rules” by Gsponer and Babu (2009) is one such paper. Protein folding, protein-protein interaction networks, protein loops (Schlessinger et al., 2007), and drug discovery all play a part in this story. What’s great about this paper is that it gives numerous examples of proteins and the evidence supporting that they are partially or completely unstructured. These are the so-called intrinsically unstructured proteins or IUPs, although more recently they are also being referred to as intrinsically disordered proteins, or IDPs. Intrinsically disordered regions (IDRs) “are polypeptide segments that do not contain sufficient hydrophobic amino acids to mediate co-operative folding” (Babu, 2016).

Such proteins contradict the classic “lock and key” hypothesis of Fischer, and challenge Continue reading →

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

Calculating the interactions formed by a small molecule from the target
Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 
plipcmd -f BRD1A-m004.pdb -o m004 -x
plipcmd -f BRD1A-m006.pdb -o m006 -x
plipcmd -f BRD1A-m009.pdb -o m009 -x

A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.



import xml.etree.ElementTree as ET
################################################################################
def generate_plif_lists(report_file, residue_list, lig_ident):
    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site
        plif_list_all = []
        tree = ET.parse(report_file)
        root = tree.getroot()
        #list of residue keys that form an interaction
        for binding_site in root.findall('bindingsite'):
                nest = binding_site.find('identifiers')
                lig_code = nest.find('hetid')
                if str(lig_code.text) == str(lig_ident):
                        #get the plifs stuff here
                        nest_residue = binding_site.find('bs_residues')
                        residue_list_tree = nest_residue.findall('bs_residue')
                        for residue in residue_list_tree:
                                res_id = residue.text
                                dict_res_temp = residue.attrib
                                if res_id not in residue_list:
                                        residue_list.append(res_id)
                                if dict_res_temp['contact'] == 'True':
                                        if res_id not in plif_list_all:
                                                plif_list_all.append(res_id)
        return plif_list_all, residue_list
###############################################################################
plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')
plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')
plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')

3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs
from rdkit.DataStructs import cDataStructs
################################################################################
def generate_rdkit_plif(residue_list, plif_list_all):
    #generates RDKit plif given list of residues in binding site and list of interacting residues
    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)
    for index, res in enumerate(residue_list):
        if res in plif_list_all:
            print 'here'
            plif_rdkit.SetBit(index)
        else:
            continue
    return plif_rdkit
#########################################################################
plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)
plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)
plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)

4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):
    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)
    print sim
    return sim 
###################################################################
print similarity_plifs(plif_m006, plif_m009)
print similarity_plifs(plif_m006, plif_m004)
print similarity_plifs(plif_m009, plif_m004)

The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

A global genetic interaction network maps a wiring diagram of cellular function

In our last group meeting, I talked about a recent paper which presents a vast amount of genetic interaction data as well as some spatial analysis of the created data. Constanzo et al. used temperature-sensitive mutant alleles to measure the interaction of ~6000 genes in the yeast Saccharomyces cerevisiae [1]. A typical way to analyse such data would be the use of community detection to find groups of genes with similar interaction pattern, see for example [2] for a review. Instead, the authors of this paper created a two-dimensional embedding of the network with a spring-layout, which places nodes close to each other if they show similar interaction pattern.

The network layout is then compared with Gene Ontology by applying a spatial analysis of functional enrichment (SAFE) [3]. Clusters enriched are associated for example with cell polarity, protein degradation, and ribosomal RNA. By filtering the network for different similarities they find a hierarchical organisation of genetic function with small dense modules of pathways or complexes at the bottom and sparse clusters representing different cell compartments at the top.

In this extensive paper, they then go further into detail to quantify gene pleiotropy, predict gene function, and how the interaction structure differs between essential and non-essential genes. They also provide that data online under http://thecellmap.org/costanzo2016/ .

(Left) A network of genetic interaction was embedded into a two-dimensional space using a spring-layout (Right) This embedding was compared with Gene Ontology terms to find regions of spatial enrichment.
Image from [1]

References:

[1] Costanzo, Michael, et al. “A global genetic interaction network maps a wiring diagram of cellular function.” Science 353.6306 (2016): aaf1420.

[2] Fortunato, Santo. “Community detection in graphs.” Physics Reports 486.3 (2010): 75-174.

[3] Baryshnikova, Anastasia. “Systematic Functional Annotation and Visualization of Biological Networks.” Cell Systems (2016).

SMEs in Research

Scientific research relies on collaboration, between academics across the world, between big and small groups of people, and between companies and universities. Many people’s first thoughts of companies involved in academia will be large multinational corporations, such as pharmaceutical or aerospace companies. However mutually beneficial relationships exist between smaller companies and academia. Scientific spin out companies, such as Oxford Nanopore Technologies, Theta Technologies or Reinnervate, are one way research is expanded from an idea held by researchers at a university and expanded into a commercial product or service. However, there are many other ways and reasons for scientists collaborating with small companies.

Small and medium sized enterprises (SMEs) are independent businesses with up to 249 people, they represent over 99% of all UK and EU businesses, and 45-50% of turnover and employment. But why would researchers be interested in involvement with SMEs? Access to unique intellectual property and innovation in the SMEs, as well as access to funding targeted at fostering research led projects in SMEs. Innovate UK, support growth by enabling and funding innovative opportunities. SMEs can access this support through Knowledge Transfer Partnerships (KTP), a three-way partnership between a graduate, academic institution and a business. These KTPs can last between 12 months and 3 years, and provide financial support for academics to monitor the graduate’s work. KTP projects lead to an average of 2 papers per project, and are attractive to SMEs as they only fund 33% of the project cost.

Impact of research outside academia is becoming increasingly important to identify as impact case studies form a part of the Research Excellence Framework (REF), making many sources of funding contingent on providing a strong case for the wider impact of the research. Work with innovation focused SMEs can provide a focus for research impact in engagement and economic terms. For example, a REF case study highlighting the impact of spintronics research in the development of non-contact sensors, through a spin out company, Salunda.

Is contacts-based protein-protein affinity prediction the way forward?

The binding affinity of protein interactions is useful information for a range of protein engineering and protein-protein interaction (PPI) network challenges. Obvious applications include the development of therapeutic antibodies to given drug targets or the engineering of novel interfaces for synthetic protein complexes. An accurate model would furthermore allow us to predict a large proportion of affinities in existing PPI networks, and enable the identification of new PPIs, which is critical for our ability to model protein network dynamics effectively.

“The design of an ideal scoring function for protein−protein docking that would also predict the binding affinity of a complex is one of the challenges in structural proteomics.” Adapted from Kastritis, Panagiotis L., and Alexandre MJJ Bonvin. Journal of proteome research 9.5 (2010): 2216-2225.

In last week’s paper a new binding-affinity prediction method based on interfacial contact information was described. Contacts have long been used to in docking methods but surprisingly this was the first time that binding affinity was predicted with them. Largely, this was due to the lack of a suitable benchmark data set that contained structural as well as affinity data . In 2011, however, Kastritis et al. presented a curated database of 144 non-redundant protein–protein complexes with experimentally determined Kd (ΔG) as well as x-ray structures.
Using this data set they trained and validated their method, compared it against others and concluded that interfacial contacts `can be considered the best structural property to describe binding strength`. This claim may be true but as we discussed in the meeting there is still some work to do before we take this model an run with it. A number of flags were raised:

Classification of experimental methods into reliable and non-reliable is based on what gives the best results with their method. Given that different types of protein complexes are often measured with different methods, some protein classes for which contact-based predictions are less effective may be excluded.
Number of parameters for model 6 is problematic without exact AIC information. As Lyuba righlty pointed out, the intercept in model 6 `explodes`. It is no surprise that the correlation improves with more parameters. Despite their AIC analysis, overfitting is still a worry due to the lack of details presented in the paper.

Comparison against other methods is biased in their favour; their method was trained on the same data set, the others were not. In order to ensure a fair comparison all methods should be trained on the same data set. Of course this is hard to do in practice, but the fact remains that a comparison of methods that has been trained on different data sets will be flawed.

Paper: Vangone, A., Bonvin, A. M. J. J., Alberts, B., Aloy, P., Russell, R., Andrusier, N., … Zhou, Y. (2015). Contacts-based prediction of binding affinity in protein-protein complexes. eLife, 4, e07454. http://doi.org/10.7554/eLife.07454

Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.

Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.

Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.

Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012. JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).

References

Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.

Structure prediction through sequence physical properties

Today, protein structure is mainly predicted by aligning the unknown amino acid sequence against all sequences for which we already know the physical structure. Whilst sequences differing in length can be readily catered for by inserting or deleting (with or without affine gap penalties) the odd amino acid, there will frequently be cases where there are mutations. To compensate for this, the likes of the BLOSUM matrix is used to score the likelihood of one amino acid having mutated into another. For example, a hydrophobic residue is more likely to be swapped for something similar, than it is to be replaced with something strongly hydrophilic.

Whilst this is an entirely reasonable basis, there are many other physical property factors which which can be considered. In fact, the Amino Acid Index currently lists 544 physicochemical and biochemical properties of amino acids. A paper by Yi He et al. PNAS 2015;112:5029-5032 recently made use of a subset of these properties to predict structure. Their work shown below shows target T0797 (B) from CASP 11 compared with a purely physical structure predicted using their method (A) and the PSI BLAST candidate for the same sequence (C).

Even though their structure only had three residues in common with the target sequence, it is plainly more similar than the PSI BLAST attempt. The RMSD between structures A and B is also reported as being 0.73 Å, whilst PSI BLAST returns an RMSD of 2.09 Å.

Oxford Protein Informatics Group

or "OPIG" to friends