Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from  conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

An antibody that exists as two different structures in equilibrium - one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.

“Identifying Allosteric Hotspots with Dynamics”: Using molecular simulation to find critical sites for the functional motions in proteins

Allosteric (allo-(“other”) + steric (repulsion of atoms due to closeness or arrangement)) sites regulate protein function from a position other than the active site or binding site. Consider the latch on a pair of gardening scissors (Figure 1): depending on the position of the latch (allosteric site) the blades are prevented from cutting things at the other end (active site).

secateurs-garden-shears-012

Figure 1 Allostery explained: A safety latch in gardening scissors.

Due to the non-trivial positions of allosteric sites in proteins their identification has been challenging. Selected well characterised systems such as GPCRs have known allosteric sites that are being used as targets in drug development. However, large scale identification of allosteric sites across the Protein Data Base (PDB) has not yet been feasible, partly because of the lack of tools.

To tackle this problem the Gerstein Lab developed a computational protocol based on various molecular simulations and network methods to find allosteric hotspots in proteins across the PDB. They introduce two different pipelines; one for identifying allosteric residues on the surface (surface-critical) and one for buried residues (interior-critical).

For the search of exterior-critical residues they us MC simulations to repeatedly probe the surface of the protein with short Monte Carlo (MC) with a short peptide. Based on hard spheres and simple energy calculations this method seems to be an efficient way of detecting possible binding pockets. Once the binding pockets have been found, the collective motions of the structure are simulated using an elastic mass-and-spring network (an anisotropic network model [ANM]). Binding pockets that undergo significant deformation during these simulations are considered to be surface-critical.

For interior-critical residues they start by weighting residue-residue contacts on the basis of collective movement. Communities within the weighted network are then identified and the residues with the highest betweenness interactions between communities are chosen as interior-critical residues. Thus, interior-critical residues have the highest information flow between two densely inter-connected groups of residues.

The protocol as been implemented in STRESS (STRucturally identified ESSential residues) and is freely available at stress.molmovdb.org.

Publication: http://www.ncbi.nlm.nih.gov/pubmed/27066750

Comp Chem Kitchen

I recently started  “Comp Chem Kitchen” with Richard Cooper and Rob Paton in the Department of Chemistry here in Oxford. It’s a regular forum and seminar series  for molecular geeks and hackers, in the original, untarnished sense of the word: people using and developing computational methods to tackle problems in chemistry, biochemistry and drug discovery. Our hope is that we will share best practices, even code snippets and software tools, and avoid re-inventing wheels.

In addition to local researchers, we invite speakers from industry and non-profits from time to time, and occasionally organize software demos and tutorials.

We also provide refreshments including free beer. (We are grateful to Prof. Phil Biggin and the MRC Proximity to Discovery Fund for offering to support CCK.)

CCK-1

Our first meeting, CCK-1,  was held in the Abbot’s Kitchen on May 24, 2016, at 5 pm, and was a great success—standing room only, in fact! The Abbot’s Kitchen—originally a laboratory—is a beautiful stone building built in 1860 in the Victorian Gothic style, alongside the Natural History Museum, at a time when Chemistry was first recognized as a discipline.

Abbot's Kitchen, Oxford, Watercolor, Detailed, 512x683We heard a fascinating talk from Jerome Wicker from the Department of Chemistry who spoke about “Machine learning for classification of solid form data extracted from CSD and ZINC”, and described a method that could successfully discriminate (~80%) whether a small molecule would crystallize or not. The software tools he discussed included RDKit, CSD, and scikit-learn. There were also two lightning talks, each 5 minutes long, one from OPIG member Hannah Patel,  from the Department of Statistics, on “Novelty Score: Prioritising compounds that potentially form novel protein-ligand interactions and novel scaffolds using an interaction centric approach”, who briefly described her Django-based web interface to her RDKit-based tool to analyse structures of ligands bound to proteins and help guide future medicinal chemistry to find novel compounds. We also had a talk from Dr Michael Charlton from InhibOx spoke about “Antibacterial Drug Discovery and Machine Learning”.

CCK-2

Our next Comp Chem Kitchen, CCK-2, will be held next Tuesday (June 14th, 2016), and you can register free for CCK-2 here.

We will have talks from:

 

Hope to see you there!  (Did I say free beer?)

RDKit User Group Meeting 2016

If you’re interested in small molecules and cheminformatics, you might like to know that registration for the 2016 RDKit User Group Meeting is now open. The meeting will be held from the 26th-28th of October in Basel, Switzerland. From their announcement:

This year’s RDKit User Group Meeting will take place from 26-28 October at the Novartis Campus in Basel, Switzerland and is co-organized by people from Roche and Novartis.

Registration for the RDKit UGM is free:

https://www.eventbrite.com/e/5th-rdkit-user-group-meeting-tickets-22539677783

The previous years’ format seemed to work pretty well and the feedback was positive, so we will stick to the same format this year:

Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and talktorials. For those who haven’t attended before, talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you’ll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.

Day 3 will be a sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We will once again try to structure this a bit by collecting a bunch of ideas for things to work on in advance.

There will also be, of course, social activities.

Looks like fun!

Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012.  JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization  using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).

 

References

Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.

Structure prediction through sequence physical properties

Today, protein structure is mainly predicted by aligning the unknown amino acid sequence against all sequences for which we already know the physical structure. Whilst sequences differing in length can be readily catered for by inserting or deleting (with or without affine gap penalties) the odd amino acid, there will frequently be cases where there are mutations. To compensate for this, the likes of the BLOSUM matrix is used to score the likelihood of one amino acid having mutated into another. For example, a hydrophobic residue is more likely to be swapped for something similar, than it is to be replaced with something strongly hydrophilic.

Whilst this is an entirely reasonable basis, there are many other physical property factors which which can be considered. In fact, the Amino Acid Index currently lists 544 physicochemical and biochemical properties of amino acids. A paper by Yi He et al. PNAS 2015;112:5029-5032 recently made use of a subset of these properties to predict structure. Their work shown below shows target T0797 (B) from CASP 11 compared with a purely physical structure predicted using their method (A) and the PSI BLAST candidate for the same sequence (C).

em structure phys properties

Even though their structure only had three residues in common with the target sequence, it is plainly more similar than the PSI BLAST attempt. The RMSD between structures A and B is also reported as being 0.73 Å, whilst PSI BLAST returns an RMSD of 2.09 Å.

Visualising Biological Data, Pt. 2

Here’s a little quick round-up of some of the tools/algorithms that I’ve seen in VIZBI, which I believe can be useful for many. For more details, I strongly advise you check out the posters page (vizbi.org/Posters/2016). There were a few that I would’ve liked to re-visit, but the webapps weren’t available (e.g. MeshCloud from the Human Genome Center, Tokyo), so maybe I’ll come back with a part 3. Here are my top five:

1. Autodesk’s Protein Viewer* (shout-outs to @_merrywang on Twitter)
As a structural bioinformatician, I’m going to be really biased here, and say that Autodesk’s Molecule viewer was the best tool that was showcased in the conference. It combines not only the capacity to visualise millions of molecules from the PDB (or your own PDB files), it also allows annotation and sharing, effectively, “snapshots” of your workspace for collaboration (see this if you want to know what I mean). AND it’s free! It’s not the fastest viewer on the planet, nor the easiest thing to use, but it is effective.

2. Vectorbase
Not related to protein structures, but a really interesting visualisation that shows information on, for example, insecticide resistance. With mosquitoes being such a huge part of today’s news, this kind of information is vital for fighting and understanding the distribution of insects across the globe.

3. Phandango
This is a genome browser which, from a one-man effort, could be a game-changer. The UI needs a little bit of work I think, but otherwise, a really valuable tool for crunching lots of genomic data in a quick fashion.

4. i-PV Circos
This is a neat circular browser that helps users view protein sequences in a circularised format. With this visualisation format becoming more popular as the days go by, I think this has the potential to be a leader in the field. At the moment the website’s a bit dark and not the most user-friendly, but some of the core functionality (e.g. highlighting residues and association of domains) is a real plus!

5. Storyline visualisation
Possibly my favourite/eye-opener tool from the entire conference. Storyline visualisation helps users understand how things progress in realtime — this has been used for movie plot data (e.g. Star Wars character and plot progression) but the general concept can be useful for biological phenomena – for example, how do cells in diseased states progress over time? How does it compare to healthy states? Can we also monitor protein dynamics using a similar concept? I think the fact that it gives a very intuitive, big-picture overview of the micro-scale dynamics was the reason why I’ve been incredibly interested in Kwan-Liu Ma’s work, and I recommend checking out his website/publications list to grab insight on improving data visualisation (in particular, network visualisation when you want to avoid hairballs!)

The list isn’t ranked in any way, and do check these out! There were other tools I would’ve really liked to review (e.g. Minardo, made by David Ma @frostickle on Twitter), but I suppose I can go on and on. At the end of the day, visualisation tools like these are meant to be quick, and help us to not only EXPLORE our data, but to EXPLAIN it too. I think we’re incredibly fortunate to have some amazing minds out there who are willing to not only create these tools, but also make them available for all.

Convergent affinity maturation.

Antibodies are the first line of defense of our organisms against noxious substances. They are the proteins which we ‘train’ to recognize noxious substances when we get immunized. Therefore understanding the immune response after being presented with an antigen is instrumental in developing novel vaccines.

One hypothesis relating to immune response to an antigen is that different organisms are likely to raise similar or even identical antibodies against the same antigen. Testing this hypothesis has become more realistic recently with the advent of Next Generation Sequencing technologies (NGS). Using NGS techniques it is possible to interrogate the sequential makeup of a large set of B-cells.

Such a study was conducted not that long time ago by Trueck et al. They have analysed antibody repertoires of five individuals, pre and post immunization to check if the immune systems converged on similar antibody sequences. The five individuals were immunized with a conjugate vaccine of HiB, MenC and TT. The antibodies were sequenced from cells extracted pre-vaccination and seven days after vaccination.

Firstly, the antibody repertoire appeared to reflect the fact that an organism was mounting an immune response as the clonality post vaccination was higher than before vaccination (more cells producing similar antibodies). Secondly, authors focused on identifying sequences from the public repertoire — those antibodies that are shared between individuals. This analysis focused on the CDR3 only, of which 47 were shared between at least two of the five individuals. Quite a large proportion of those sequences were known to be specific towards HiB and the enrichment of these was muhch higher in the post-vaccination sample. Only one sequence in this set was known previously to target TT. Nevertheless, relaxing the sequence similarity condition, a lot of sequences related to those known to be TT-specific were found among the five individuals. Most importantly, the number of such sequences was much higher in the post-vaccination samples, indicating that these might indeed have been raised in response to TT stimulation. The same was not true for MenC as hardly any sequences related to this antigen were found in the immune response of the five individuals.

Therefore, authors claim that looking at the enrichment of such shared sequences can be an indicator of the effectiveness of the immune response. They correlate statistics coming from looking at the number of shared sequences which appear to have moderate correlation to the antibody avidity data (even though p-values in some cases are quite high). This indicates that even in such a small set of individuals, antibodies are capable of converging on similar solutions. This might provide clues as to the characteristics antibodies that recognize specific antigens and thus facilitate novel vaccine design.

 

 

 

Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences  81.4 (1984): 1075­1078.

Journal Club: “Discriminative Chemical Patterns: Automatic and Interactive Design”

For Journal Club this week I decided to discuss the following paper by M. Rarey et al., which describes a method of using SMARTS patterns to discriminate between two sets of molecules. Link to paper here.

Given two sets of molecules can one generate a pattern that discriminates between two sets? This relates to a key question in drug design: can we predict whether molecules bind or not given a set of binders and a set of non-binders. The method is of particular interest because it makes use of data available, unlike conventional methods. However, for this technique to work, the correct molecular classification is required to discriminate between the two sets of molecules.

Originally molecules were classified using physiochemical properties for example, molecular weight or log P. However these classifications are too general and do not encompass enough molecular detail for accurate discrimination. An alternative is to using topological fingerprints. These encode a set the presence of a set of topological features using a series of bits. One of the limitations with this classification is that it is restricted by the predefined set of structures and features. This method makes use of chemical patterns which advantageously can can classify a chemical feature that cannot be sufficiently described by molecular substructure.

SMARTS (a molecular description language based on SMILES) allow description of structures with varying levels of specificity. For example one can specify atomic element, whether the atom is a subset of elements, whether it is aliphatic or aromatic, or whether it is in a ring. The method makes use of this description of molecules as the group have already developed some software to visualise SMARTS strings and modify them: the SMARTSeditor.

The method involves combining automatic pattern generation and visualisation to form SMARTSminer. Given two distinct molecule sets, the algorithm derives connected chemical patterns to differentiate both sets by using a sub-graph mining technique: solutions are extended by single elements iteratively.

The SMARTSminer was then used to test a series of test cases using the DUD (Database of Useful Decoys) data set. This seems strange when the data set has been shown to be inaccurate and perhaps there are more accurate test sets available, such as DUDe (Database of Useful Decoys enchanced). Let us look a couple of these case sets in more detail.

  1. Discrimination between Active Molecules on Similar Targets

The first case set looks at discriminating between molecules that are active for COX-1 and COX-2. COX proteins are cyclooxygenase that are involved in inflammatory reaction. These proteins are targeted by inhibitors such as aspirin and ibuprofen for the relief of inflammatin and pain. Both COX-1 and COX-2 are similar targets with similar molecular weight and 65% sequence identity. Selective inhibition is only due to a difference in residue at position 523.

Separation of the sets of molecules was possible with a pattern identified that hit 21/25 of the molecules active for COX-1 and 15/348 of molecules of molecules active for COX-2. When the positive and negative set are reversed a pattern is identified that matched 313/348 of COX-2 actives but only 1 of the COX-1 ligands. The group state that perfect separation is not possible as there is an overlap of 2 molecules.

It is interesting that patterns were identified that could discriminate between the two sets. However, there is no discussion of how to use this information. Additionally the pattern determined has not been tested on any molecules outside of the training set – there are no blind tests. This seems strange as a blind test could emphasise the usefulness of this method if it was successful.

2. Discrimination between Active and Inactive Molecules

The second case investigates determining whether a pattern can be generated that discriminates between active and inactive targets. The test case used target SAHH (S-adenosyl-homocysteine hydrolase). A pattern was generated that matched all active molecules and only 1% of inactives. What is particularly exciting is that the pattern found contains part of the interaction network hydrogen bonding partners of the ligand, as shown in the figure below (the pattern identified is highlighted in green).

pattern

I find it very surprising that the group did not follow up with blind tests of molecules not used in the training set – especially as the pattern identified a key part of the binding mechanism.

To summarise a new method, SMARTSminer, calculates discriminative patterns between two sets of molecules using the SMARTS language. The authors state that the method has shown applicability in several use cases covering the application of actives vs decoys, kinase classifications, analysis of data sets and characterisation of reaction centers. However, I’m not sure I can agree with that statement. I believe further blind tests would be required to prove the applicability of the method once the pattern has been found. I also believe that an analysis of whether the pattern is over fitted to the training data is also required.