Get PDB intermolecular protein contacts and interface residues

Very often in Struc Bio it is necessary to determine the contacts between two molecules. Most of us in the group have written a snippet of code to compute precisely that or they have adapted the Biopython functionality or one of the tools in pdbtools. The piece of code written in Python presented here is a Biopython variety that gives you the intermolecular contacts and it annotates the interface neighborhood. An example of the program output is given in the Figure below:

The complex between an antibody and an antigen is shown on the left without the annotation. On the right it is shown with intermolecular contacts annotated in red (4.5A distance) and the interface neighborhood shown in green (10A away from any contact residue).

Download

You can download it from here. There are three files inside:

GetInterfaces.py – the main source/runnable file
README.txt – instructions, very similar to this post (quite a lot copy/pasted)
1A2Y.pdb – the PDB file used in the example to practice on.

Requirements:

You need Biopython. (if you are from OPIG or any other Bioinformatics group, most likely it is already installed on your machine). You can download it from here.

How to use it?

As the bare minimum, you need to provide the structure of the pdb(s) and the chains that you want to examine contacts in.

Input options:

–f1 : first pdb file [Required]
–f2 : second pdb file (if the contacts are to be calculated in the same molecule, just submit the same pdb in both cases) [Required]
–c1 : Chains to be used for the first molecule [Required]
–c2 : Chains to be used for the second molecule [Required]
–c : contact cutoff for intermolecular contacts (Optional, set to 4.5A if not supplied on input)
–i : interface neighbor cutoff for intramolecular neighborhood of the contacting interface (Optional, set to 10.0A if not supplied on input). Set this option to zero (0.0) if you only want to get the intermolecular contacts in the interface, without the interface neighborhood.
–jobid : name for the output folder (Set to out_<random number> if not supplied on input)

An example which you can just copy paste and run when in the same directory as the python script:

python GetInterfaces.py --f1 1A2Y.pdb --f2 1A2Y.pdb --c1 AB --c2 C --c 4.5 --i 10.0 --jobid example_output

Above command will calculate the contacts between antibody in 1a2y (chains A and B) and the antigen (chain C). The contact distance was defined as 4.5A and the interface distance was defined as 10A. All the output files are saved in the folder out_example_output.

Output

Output folder is placed in the current directory. If you specify the output folder name (–jobid) it will be saved under the name ‘out_[whateveryoutyped]’, otherwise it will be ‘out_[randomgeneratednumber]’. The program tells you at the end where it saved all the files.

Input options:

molecule_1.pdb – the first supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_2.pdb – the second supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_1.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_2.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_1_constrained.pdb – the first molecule, which is constrained only to the residues in the interface.
molecule_2_constrained.pdb – the second molecule, which is constrained only to the residues in the interface.
parameters.txt – the contact distance and neighborhood distance used for the particular run.

Network Analysis

Why networks?

Individual expression could be thought as a phenomenon regulated mostly by the individual, but in a second stand it is also modified by the interactions with the surroundings. Can the response of the individual be predicted by the group? (See the following video of an experiment conducted by Asch https://www.youtube.com/watch?v=FnT2FcuZaYI)

networkinside

Most common type of network analysis

Basic network summary statistics (description)
Clustering methods (Extract information)
Random graphs (Description, inference and to model network topology)
Learning machine methods (Prediction)

Random Graphs and the topology structure

Depending on the structure of a desired network different random models could be of use, for example, if the goal is to obtain a sparse and not highly connected network then an ER model could be of use (this model randomly assign the edges between nodes)

or if the goal is exactly the opposite (have a very highly connected network) a geometric graph could be of use (this model randomly assign positions in a n-dimensional space and then place edges between nodes closer than a given distance).

Is there already a random model?

According to our recent results we suspect there is no null model yet for PPIs, even though for some virus PPIs some of the random models seem to be very good models; however this virus PPIs are much smaller (around 300 nodes and up to 500 edges) than the networks of model organisms (usually with more than 2000 nodes and 5000 edges) such as yeast, human, fruit fly and Escherichia coli among others.

We will soon be publishing our article with details about this.

[Publication] Arginine Methylation-Dependent Reader-Writer Interplay Governs Growth Control by E2F-1

If you are familiar with the reader, writer & eraser concepts or you are passionate about epigenetics and arginines, this recent publication might be of interest to you. The study addresses the transcription factor E2F-1, which plays a crucial role in the control of cell cycle and is linked with cancer. Like Yin-yang, it has opposing functional roles: to promote cell-cycle progression and to induce apoptosis. The results demonstrate that the biological outcome of E2F-1 activity is affected by arginine methylation marks. While asymmetric arginine methylation causes apoptosis, the symmetrical methylation results in proliferation. This reader-writer interplay determined by the two types of marks governs the function of E2F-1 and potentially the fate of the cell.

Remove all LaTeX generated files

I am going to leave this here and bookmark it, because I am fed up of looking this up every time, not finding it and having to `history | fgrep rm`. To be used if you want to delete all LaTeX generated (and pdfLaTeX) files.

rm *.aux *.out *.toc *.log *.synctex.gz *.bbl *.blg *.pdf

Use at your own risk!

Progress report: Molecular Dynamics simulations of TCRpMHC

Last time I gave a presentation about different papers describing Molecular Dynamics simulations of TCRpMHC (http://blopig.com/blog/?p=648&preview=true). This time I extended this to more of my own work. The focus of this talk was about the motivation why to choose a certain system and which requirements if must fulfíl for reliable conclusions. In short they are:

1) A larger number (>100) of experimental values (e.g. some kind of binding, immunogenicity, etc) of a certain TCRpMHC system. These values should be provided by:
– the same group
– the same HLA/MHC batches
– the same binding conditions
– in the ideal case in the same manuscript

The values should NOT be from a database since different experimental groups come to extremely different results for the same complexes as several examples show e.g.:
The binding IC50 affinity values of PKYVKQNTLKLAT to DRB1*01:01 range from 1nM and 600nM (IEDB entries).

2) Structural data exactly matching the systematic experimental data mentioned above. If multiple structures are involved they should be published by:

– the same group (in the ideal case even published together with the above mentioned data by the same group)
– be contemporary

Data which fulfil the above mentioned criteria is quite hard to find since most biologists are mainly interested in some fancy, however, rather anecdotal examples and rarely in systematic data production.

Once one has found such a system a large number of Molecular Dynamics simulations can be performed which will yield systematic differences not just differences originating from random events or data bias.

Django for scientific applications

In my current work I am developing a cheminformatics tool using structural and activity data to investigate protein-ligand binding. I ~~have only ever properly used~~ love python and I listen to Saulo, so I decided to used Django to develop my application. I didn’t understand what it was and why it might be useful before I started using it but below I thought I’d discuss a few of the features that I think have been useful and might encourage others to use it.

Firstly I will outline how Django works. I wanted to download all the PDB structures for CDK2 and store the information in a data structure that is robust and easily used. We have a Target and a Protein. A Target is associated to a particular UniProt accession. Cyclin-dependent kinase 2 (CDK2) is a Target. A Protein is a set of 3D coordinates, so 1AQ1 is a Protein.

class Target(models.Model):
"""A Django model to define a given protein target"""
    UniProt = models.CharField(max_length=20,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    Title = models.CharField(max_length=10)

In the above Target model I have three different fields. The first field denotes the UniProt accession for the Target and is “unique”. This means that only one Target can have any given UniProt accession in my data structure. If I try to add another with the same value in the UniProt field it will throw an exception. The second field denotes the time and date that the model was created. This means I can check back to when the target was created. The third is the Title I would like to use for this, for example CDK2.

I can then make a new Target objects by:

new_target = Target()
new_target.Title = "CDK2"
new_target.UniProt = "P24941"

and save it to the database by:

new_target.save() # Django takes care of the required SQL

The next model is for the Protein molecules:

class Protein(models.Model):
    """A Django model to define a given protein"""
    Code = models.CharField(max_length=6,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    TargetID = models.ForeignKey(Target)
    Apo = models.BoolenField()
    PDBInfo = models.FileField(upload_to='pdb')

The model contains the PDB Code, e.g. 1AQ1, and the date it was added to the database. It also consists of a foreign key, relating it to its Target and a boolean indicating if the structure is apo or holo. Finally there is a file field relating this entry to the appropriate file path where the PDB information is stored.

Once the data has been added to the database, Django then deals with all SQL queries from the database:

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
CDK2_prots = Protein.objects.filter(TargetID__Title="CDK2") # All PDB entries associated to CDK2, as a query set, behaving similarily to a list
CDK2_list = [x for x in CDK2_prots] # Now exactly like a list

The “__” in the above query allows one to span the foreign key relationship, so it is searching for the Title of the Target not the Title of the Protein. Finally I can then access the PDB files for each of these proteins.

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
print my_prot.Code # prints "1aq1"
# my_prot.PDBInfo has the behaviour of a file handle
pdb_lines = my_prot.PDBInfo.readlines()# Reads the lines of the file

There, you’ve made a queryable database, where Django deals with all the hard stuff and everything is native to python. Obviously in this example it might not be so difficult to imagine alternative ways of creating the same thing using directory structures, but as the structure of your data becomes more complex, Django can be easily manipulated and as it grow it utilises the speed advantages of modern databases.

What is a kink?

If you ask someone who works with membrane protein structures what they think a helix kink is, you will probably get a fairly vague answer. They would probably tell you something like a kink is a point in the helix where there is a sharp change in the direction of the helix axis. Like the helix on the right in the picture.

Two example helices

They might also tell you that they frequently (or always) involve Proline, and that they feature the absence of some of the backbone hydrogen bonds that you would otherwise expect to observe in a helix. But that is about as far as the consensus goes.

There are a number of published methods to identify kinks in helices (ProKink, Helanal, MC-Helan, and TMKink, to name a few). The detail of identifying kinks (and hence the definition of kinks) differs a lot between the methods. That would be fine if they all identified the same helices as having kinks, and the same positions in each helix as kinked, but they do not. Even the number of helices that are kinked differs a lot – anything from 6 to 60%. It would be easy if all α-helices were like the two in the figure – clearly kinked or straight. Generally the methods agree on these ‘easy’ helices. But there are lots of helices that are difficult to classify.

Why do we want to computationally identify them? Well, kinks are functionally important in membrane proteins. Membrane proteins are both medicinally important, and hard to experimentally determine the structure of. So, modelling their structures is a useful thing to do, and their structure includes kinks. But also, they are a structural motif. If we can write down what we think a kink is, use that definition to identify kinks, and then use information from those kinks we identified to (reliably) predict the existence of other kinks, we will know that we fully understand to motif.

There are a remarkably large number of perfectly sensible ways to identify a sharp change in the axis of an α-helix, given the coordinates of the backbone atoms.
The published methods (mentioned above) are a bit cumbersome (they require you to create input files, and parse at least one, if not many, output files), so from my experience people tend to make their own methods (That is not just me with Kink Finder, but many of the people who need to identify kinks that I have spoken to at conferences do not use a published method).

A short summary of ways to identify kinks:

Fit axes to short sections of helix (4-8 residues). Give each residue an angle calculated from the angle between the axis of the section before it, and the axis of the section after it. If the largest angle in the helix is greater than a threshold, annotate the helix as kinked, at the residue with the largest angle
Methods similar to (1), but using a measure other than an angle. For example, the curvature of a line fitted to the helix, or the distance between the ith and (i+4)th Cα residue
Identify sections of the helix that are ‘straight’. Where a helix contains more than on section, it is kinked
Look at them, and classify

Even for (1), there are lots of ways to calculate a helix axis. You can us least-squares (very sensitive to the number of residues used in the fit), or fitting to a cylinder (better, but slower), or a vector method using 4 Cαs. Similarly for (2), you could fit a spline, or just a polynomial. This is before you decide on (e.g.) how many residues you are going to fit axes to (6?, 7? 8?), or what smoothing parameter to use in the spline fit, or what order polynomial.

How should we classify this helix?

The point is, there are lots of ways to do it, but there is no consensus definition. You can use one method, or 2, or more, but they will give you different answers. Because we have no concise definition, it is hard to say which helix classification is ‘right’ and which one is ‘wrong’. Take this helix, it is not perfectly straight, but is it not straight enough to be kinked? Or is the change in axis over a number of turns, making it curved rather than kinked?

There are a few ways round this problem, where your computer program is struggling to ‘correctly’ classify helices. Rather than using a computer, you could manually curate a set, which has been done. However, there are a few issues here. First, their definition of a kink was a bit vague – and certainly difficult to turn into a computer program. Second, humans are very good at seeing patterns, even when they are not really there. Third, the end aim, at least for me, is to incorporate this into structure prediction, where a computational formulation of a kink is necessary. Another solution is to accept that your program is deficient, and put the borderline (i.e. difficult) cases into a third group, and removing them from your analysis. This does help to improve your signal, but is a little unsatisfactory.

To conclude, there are not two types of helices, there is more of a spectrum, from very straight, through a bit kinked, to very kinked. Classifying them is a hard problem, with lots a different approaches available. But, while the methods give different answers (particularly in identifying the position of the kink in the helix), and they do not indicate the same causes of kinks, there is work to be done.

A very long introductory post about protein structure prediction

If you are a protein informatician, bioinformatician, biochemist, biologist or simply a person well informed about science, you probably heard about protein structure prediction. If that is the case, you might be wondering what all the fuss is about, right? If you never heard those terms before, don’t panic! You are about to find out what protein structure prediction is all about!

Based on my group meeting’s presentation last Wednesday, this blog entry will discuss why protein structure prediction is important and the potential limitations of existing methods. I will also discuss how the quality of input may be a potential source for lack of accuracy in existing software.

First, let us remember a little biology: our genetic code encrypts the inner-works of a complicated cellular machinery tightly regulated by other (macro)molecules such as proteins and RNAs. These two types of macromolecules are agents that perform the set of instructions codified by DNA. Basically, RNAs and proteins are involved in a series of processes that regulate cellular function and control how the genetic code is accessed and used.

For that reason, a huge chunk of genomic data can be ~~pretty useless~~ not that useful if considered on their own. Scientists around the globe have invested millions of moneys and a huge chunk of time in order to amass piles and piles of genome sequencing data. To be fair, this whole “gotta sequence ’em all” mania did not provide us with the fundamental answers everybody was hoping for. Cracking the genetic code was like watching an episode of Lost, in which we were left with more questions than answers. We got a very complicated map that we can’t really understand just yet.

For that reason, I feel obliged to justify myself: protein structures ARE useful. If we know a protein structure, we can formulate a very educated guess about that protein’s function. Combine that with empirical data (e.g. where and when the protein is expressed) and it can help us unveil a lot of info about the protein’s role in cellular processes. Basically, it can answer some of the questions about the (genomic) map. ~~If only we could do that with Lost…~~

There is also evidence that knowing a protein’s structure can help us design specific drugs to target and inhibit that protein. Although the evidence of such biomedical application is sparse, I believe that with development of the field, there is a trend for protein structures to become more and more important in drug discovery protocols.

Still, if we look at the number of known genome sequences and known protein structures and at the growth of those figures over the past decade, we look at a drastic scenario:

There is a tendency for the gap between the number of protein sequences and protein structures to increase. Hence, we are getting more and more questions and little to no answers. Observe how the green line (the protein sequences associated with a known or predicted function) is very close to the red line (the number of known protein structures). However, there is a growing gap between the red and the blue line (the number of protein sequences). Source: http://gorbi.irb.hr/en/method/growth-of-sequence-databases/

Well, gathering protein structure data is just as important, if not more important, than gathering sequence data. This motivated the creation of Structural Genomics Consortiums (SGC), facilities that specialize in solving protein structures.

I am sorry to tell you that this is all old news. We have known this for years. Nonetheless, the graph above hasn’t changed. Why? The cost limitations and the experimental difficulties associated with protein structure determination are holding us back. Solving protein structures in the lab is hard and time consuming and we are far from being as efficient at structure determination as we are at genome sequencing.

There is a possible solution to the problem: you start with a protein sequence (a sequential aminoacid list) and you try to predict its structure. This is known as protein structure prediction or protein structure modelling. Well, we have a limited number of building blocks (20) and a good understanding of their physicochemical properties, it shouldn’t be that hard right?

Unfortunately, modelling protein structure is not as simple as calculating how fast a block slides on an inclined plane. Predicting protein structure from sequence is a very hard problem indeed! It has troubled a plethora of minds throughout the past decades, making people lose many nights of sleep (I can vouch for that).

We can attribute that to two major limitations:

1- There are so many possible ways one can combine 20 “blocks” in a sequence of hundreds of aminoacids. Each aminoacid can also assume a limited range of conformations. We are looking at a massive combinatorial problem. The conformational space (the space of valid conformations a protein with a given sequence can assume) is so large that if you could check a single conformation every nanosecond, it would still take longer than the age of the universe to probe all possible conformations.

2- Our physics (and our statistics) are inaccurate. We perform so many approximations in order to make the calculations feasible with current computers that we end up with very inaccurate models.

Ok! So now you should know what protein structure prediction is, why it is important and, more importantly, why it is such a hard problem to solve. I am going to finish off by giving you a brief overview of the two most commons approaches to perform protein structure prediction: template-based modelling (also known as homology modelling) and de novo structure prediction.

There is a general understanding that if two proteins have very similar sequences (namely, if they are homologs), than they will have similar structures. So, we can use known structures of homologs as templates to predict other structures. This is known as homology modelling.

One can do a lot of fancy talk to justify why this works. There is the evolutionary argument: “selective pressure acts on the phenotype level (which can encompass a protein structure) rather than the genotype level. Hence protein structures tend to be more conserved than sequence. For that reason and considering that sequence alone is enough to determine structure, similar sequences will have even more similar structures.”

One can also formulate some sort of physics argument: “a similar aminoacid composition will lead to a similar behaviour of the interacting forces that keep the protein structure packed together. Furthermore, the energy minimum where a certain protein structure sits is so stable that it would take quite a lot of changes in the sequence to disturb that minimum energy conformation drastically.”

Probably the best argument in favour of homology modelling is that it works somewhat well. Of course, the accuracy of the models has a strong dependency on the sequence similarity, but for proteins with more than 40% identity, we can use this method in order to obtain good results.

This raises another issue: what if we can’t find a homolog with known structure? How can we model our templateless protein sequence then? Well, turns out that if we group proteins together into families based on their sequence similarity, more than half of the families would not have a member with known structure. [This data was obtained by looking at the representativeness of the Pfam (a protein family database) on the PDB (a protein structure database).]

Ergo, for a majority of cases we have to perform predictions from scratch (known as free modelling or de novo modelling).

Well, not necessarily from scratch. There is a specific approach to free modelling where we can build our models using existing knowledge. We can use chunks of protein, contiguous fragments extracted from known structures, to generate models. This is known as a fragment-based approach to de novo protein structure prediction. And that is one big name!

One can think of this as a small scale homology modelling, where both the physics and evolutionary arguments should still hold true to some degree. And how do we do? Can we generate good models? We perform appallingly! Accuracies are too low to generate any useful knowledge in a majority of cases. The problem with the rare cases when you get it right is that you have no means to know if you actually got the right answer.

The poor quality of the results can be justified by the 2 biggest limitations discussed above. Yet something else might be in play. In homology modelling, if you use a bad template, you will most certainly get a bad model. In a similar way, using a bad set of fragments will lead you to a very poor final model.

Considering we already have the other two big issues (size of conformational space and accuracy of current potentials) to worry about, we should aim to use the best fragment library we possibly can. This has been the recent focus of my work. An attempt to make a small contribution to solve such a hard problem.

I would love to detail my work on finding better fragments here, but I believe this post is already far too long for anyone to actually endure it and read it until the end. So, congratulations if you made it through!

ISMB/ECCB Conference 2013 (Berlin)

It’s that time of the year again… when an intrepid group of OPIGlets trundle back tired but happy from another successful conference (this time it was ISMB/ECCB and its satellite conference 3Dsig in Berlin) armed with their favourite titbits from the presented work. This blog post is a mashup of some of our highlights as presented at the last group meeting.

Post-schnitzel and out and about in Berlin!

Definitely one of the best things for me was getting the chance to hear Sir Tom Blundell (our very own academic grandfather… Charlotte’s supervisor) give the keynote at 3Dsig, talking about everything from the structure of insulin to his deep, underlying love of jazz. Here are some more of our favourite things…

Empirical contact potentials derived from binding free energy changes upon mutation
(poster presentation by Iain H. Moal and Juan Fernández Racio)

Chosen by Jinwoo Leem

I was impressed by Moal (et al.)’s poster on predicting protein-protein binding affinities (in fact, it won the poster prize at 3D-Sig!). The poster describes a statistical potential that considers the number of mutations in a protein, and the type of interatomic contacts. Two variants of the potential were made; one for considering all atoms (atomic potential), and one considering residue side chains, represented as a centroid atom (residue potential). Ultimately, the energy change is represented as:

where N is the matrix of interatomic contacts between atoms i,j and P is a vector of contact types. Using weighted least-squares to minimise the residuals, r, the equation was used to predict affinity (ΔG) and affinity changes following mutations (ΔΔG).

As we can see in the top two graphs, the model shows decent performance for predicting ΔΔG of enzyme-inhibitor interactions, i.e. the model can indicate how a mutation affects binding affinities. Having said this, the ΔΔG predictions for Ab-Ag interactions were poor (Pearson’s correlation = 0.5-0.6).

Moreover, when the same potential was used to predict ΔG (bottom two graphs), the correlations were even worse. In fact, for flexible protein pairs, i.e. receptor-ligand pairs whose interface RMSD > 1.0Å, the correlation has gone to as low as 0.02.

Although the results are disappointing with respect to ΔG prediction, the model raises two interesting points. First, this is one of the few scoring functions that are specifically designed to predict affinity, rather than giving an arbitrary score for low RMSD. In addition, this model re-iterates the challenges in predicting Ab-Ag interactions. The solution for the latter point is not yet clear, but it may be of interest to re-train the model specifically with Ab-Ag complexes, and see if the model’s performance improves!

Predicting protein contact map using evolutionary and physical constraints by integer programming
(paper presentation by Zhiyong Wang and Jinbo Xu)

Chosen by Saulo de Oliveira

Last week, I decided to present a quick overview of a Paper Presentation I attended during the ISMB 2013.

The title of the presentation was “Predicting protein contact map using evolutionary and physical constraints by integer programming.” based on a paper by the same name.

Contact prediction (or evolutionary constraint prediction, a term I am much more fond of) was a trendy topic both at the 3DSig (2013) and at the ISMB (2013), with several presentations and posters on the subject.

In this particular presentation, Zhiyong Wang and Jinbo Xu described a new method to identify evolutionary constraints. The big differential of their talk and their work was approaching the problem in a different angle: their aim was to predict contacts when you have a low number of sequences in the multiple sequence alignment (refer to previous posts in the blog for an introduction to contact prediction).

They proposed a combination of machine learning and integer programming (similar to linear programming, again a topic we discussed previously here) to perform their predictions.

The features of the machine learning did not present any innovation. They were quite standard in the field such as mutation rates on PSIBLAST profiles and the Mutual Information (MI). The results of the Random Forest algorithm was employed to formulate constraints in a linear problem. These constraints were used to enforce physical properties of proteins, based mostly on our understanding of secondary structure.

Results seemed positive in both a random test set (CASP10) and 2 other test sets. By positive, I mean there was an improvement on the current state-of-the-art, especially for proteins with 10-1000 sequences in the MSA. Still, their precision was around 30, 40% for the top L/10 predictions (where L is the protein length). Further improvements are still necessary before we can apply these evolutionary constraints to improve protein structure prediction.

Evolution of drug resistance: structural models
(presentation by Maria Safi)

Chosen by Hannah Edwards

I found this talk by Maria Safi (which won the prize for best non-keynote presentation at 3Dsig) to be a really interesting method, despite my complete lack of background knowledge in the area (what are conferences for but to expose you to new ideas, right?).

Their idea was to produce a computationally viable method for identifying drug resistance in a protein’s mutant space. Drugs work by binding to their target protein in such a way as to inhibit its native function. If the protein mutates so as to maintain its native function but impair its binding to the drug it acquires resistance. The problem is, even within a reasonable distance of the native sequence, a proteins’ mutant space is huge, and it’s by no means trivial to test for maintenance of function and binding energy.

The groups’ solution was to recognise that the vast majority of mutant space would not be of interest. As such they send their candidate mutants through a 2-pass search: the first, a quick and easy algorithm to swiftly eliminate the dead end mutants… those that either are not resistant to drug binding or do not maintain their native function, and the second, a more biochemically accurate yet computationally expensive algorithm to be applied to the shortlist identified during the first pass.

The first algorithm is based on restricted dead-end elimination which aims to minimise a simple energy potential based on the protein’s structural stability and it’s binding energy to the drug. The algorithm keeps the backbone structure constant but by differing the side chain conformations, the mutants result in different energy potentials. A mutation at residue r can then be eliminated if an alternative mutation at r will always result in a lower energy potential.

The second algorithm is based on the more sophisticated methodology of MM-PBSA, combining molecular mechanics with the Poisson-Boltzman Surface Area calculations to estimate the free energy of the compound. This final run identifies the candidate mutants.

A significant strength of their method is that it requires only the crystal structures of the drug and target protein. As a purely structural model it eliminates the need for large amounts of training data, which, for newly emerging diseases and drugs, is often impossible to have access to.

The main focus of Maria’s talk however was using these energy potentials to predict evolutionary pathways from a wild-type protein to a resistant candidate. By treating evolution as a random walk through mutant space, weighted by the energy potentials, and assuming selection pressure of resistance, they were able to computationally simulate evolutionary scenarios.

For example, Maria focussed on the ritonavir-HIV protease complex to illustrate this method. The majority of the mutants with resistance to ritonavir which have been observed in nature were predicted by the method. For the candidates that were predicted but have not been seen, further elucidation could be found from the simulated evolutionary pathways: around 60% of these candidates were not accessible under the evolutionary model.

Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures
(presentation by Jose M. Duarte)

Chosen by Henry Wilman

Jose Duarte presented a tool, EPPIC, which identifies and classifies protein interfaces from pdb structures. The talk was titled ‘Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures’, and follows from their publication Protein interface classification by evolutionary analysis, Duarte JM, Srebniak A, Schärer MA, Capitani G. BMC Bioinformatics. 2012 Dec 22..

As the title suggests, this uses both structural and sequence information to classify protein contacts as biological or crystal. There is a webserver, and a downloadable version. There are a number of methods that exist to classify interfaces, and this differs in a few ways.

The previous methods typically rely on the area of the interface. As you see in the paper, even the benchmark sets used to test the other methods are biased such that biological contacts have much greater areas than crystal contacts. When the authors constructed a set where the contact area was similar, they found the previous methods performed generally poorly. However, there are a number of ways that you can define the interface or contact area, and specifically what people call ‘core residues’ of the interface. They found one study performed much better on their training set than the others. This defined core residues as ones that lost the majority of their solvent accessible surface area on binding to the interface. A simple cut off of >= 6 core residues at an interface produced a good classification.

In addition to this, they used sequence information. We know that interfaces are important, and often mutations at interface residues are bad. So, for a biological interface, we would expect residues to be better conserved than non-interacting surface residues. The authors used sequence entropy as a measure of the conservation. They calculated this by collecting homologous sequences with PSI-Blast and aligned them using Clustal-Omega. For each position in the alignment, if x is the occupancy frequency for a given amino acid, the sequence entropy is given by the sum over all amino acids of xlog(x). (They actually use a reduced alphabet for this, to avoid penalising mutations to similar amino acids). They then compare the entropy of the ‘core’ residues in the interface to those on the surface of the protein, and those on the periphery of the interface. If the core residues have lower entropy, then the contact is classed as biological. There are simple thresholds for both of these comparisons.

They have three metrics – one structural (number of core residues), and two sequence (entropy of core residues vs. peripheral residues, and entropy of core residues vs. surface residues). They classify based on a majority vote of the three methods. If there are an insufficient number of homologous sequences (i.e. fewer than 8), then they ignore the sequence scores, and classify using the structure only.

So why do we care about protein interfaces? Lots of us work with experimental protein structures. Many of these come from X-ray crystallography experiments. This means that when the structural information is captured, the protein is not isolated – instead it is packed against many other copies of itself. A bit like a brick in a wall – a wall many rows of bricks deep. So our protein is in contact with many others. Some of these contacts occur within the natural environment of the cell, others are a result of the crystal packing structure.
Now, protein interfaces are important. ‘Why?’, I hear you ask. Interfaces are how proteins interact with each other, and with other molecules. Some interfaces are large, some small, some are involved in transient interactions, others in permanent ones. Many diseases occur due to amino acid mutations at these interfaces. These change how the protein functions, which can cause bad things to happen. Similarly, if we can learn how to interact with an interface, we are (theoretically) able to alter the function with some sort of drug, and cause good things (whilst avoiding bad things).

So, this raises a few questions for people who study protein crystal structures. Things like, which bits of my protein interact? Do these interactions happen in a cell? Is my structure somehow distorted by the crystal packing? This EPPIC tool gives us one way to answer these.

Congratulations on reaching the end of this blog post… as a reward, the epic Brandenburg gate (taken by Jin)

Oxford Protein Informatics Group

or "OPIG" to friends