Author Archives: Claire Marks

Making Pretty Pictures with PyMOL

There’s few things I like more in our field than the opportunity to make a really nice image of a protein structure. Don’t judge me, but I’ve been known to spend the occasional evening in front of the TV with a cup of tea and PyMOL open in front of me! I’ve presented on the subject at a couple of our research group retreats, and have wanted to type it up into a blog post for a while – and this is the last opportunity I will have, since I will be leaving in just a few weeks time, after nearly eight years (!) as an OPIGlet. So, here goes – my tips and tricks for making pretty pictures with PyMOL!

Ray Tracing

set ray_trace_mode, number

I always ray trace my images to make them higher quality. It can take a while for large proteins, but it’s always worth it! My favourite setting is 1, but 3 can be fun to make things a bit more cartoon-ish.

You can also improve the quality of the image by increasing the ‘surface_quality’ and ‘cartoon_sampling’ settings.

Continue reading

Speaking about Sequence and Structure at a Summit

A couple of weeks ago I was lucky enough to be asked to speak at the 5th Computational Drug Discovery & Development for Biologics Summit. This was my first virtual conference – it was a shame I didn’t get to visit Boston, and presenting to my empty room was slightly bizarre, but it was great to hear what people have been working on, and there’s definitely something to be said for attending a conference in fluffy socks…

A, antibody structure. An antibody is made up of four chains: two light (orange) and two heavy (blue). Each chain is made up of a series of domains—the variable domains of the light and heavy chains together are known as the Fv region (shown on the right; PDB entry 12E8). The Fv features six loops known as complementarity determining regions or CDRs (shown in dark blue); these are mainly responsible for antigen binding. B, example sequences for the VH and VL, highlighting the CDR regions and the genetic composition. It is estimated that the human antibody repertoire contains up to 1013 unique sequences, enabling the immune system to respond to almost any antigen. This is possible through the recombination of V, D and J gene segments, junctional diversification, and somatic hypermutation.
Continue reading

SAbBox – the easy way to obtain our antibody tools

A significant part of the work we do here in OPIG revolves around antibodies, the proteins of the immune system that bind to and help remove any foreign entities that find their way into the body. Since antibodies can be developed that target basically anything, they have become extremely useful as therapeutics. In our research, we develop computational tools that can be incorporated into various points along the antibody discovery pipeline. These tools include our database of antibody structures, SAbDab, and a series of predictive tools (e.g. structural modelling algorithms like ABodyBuilder) which are known collectively as SAbPred.

Continue reading

More Fun With 3D Printing

Recently the students of the Systems Approaches to Biomedical Science Centre for Doctoral Training took a 2-week module on our favourite subject: structural biology! As part of this, they were given the option to create their very own 3D printed model of a protein.

This year we had some great models created, some of which are shown in the picture above. The proteins are (clockwise from top left):

  • Clathrin (PDB 1XI4) – a really interesting protein that forms cages around vesicles inside the cell. This one was mine; I wrote about clathrin as part of my undergraduate dissertation many years ago…
  • GTPase (PDB 1YZN) – a protein that can bind and hydrolyse guanosine triphosphate (GTP), involved in membrane trafficking
  • TAL effector (PDB 3UGM) – this bacterial protein binds to specific regions of DNA in a host plant to activate the expression of plant genes that aid bacterial infection. The DNA here is in blue, the orange wrapped around it is the protein.
  • Mechanotransduction ion channel (PDB 5VKQ) – converts mechanical stimuli into electrical signals in specialized sensory cells.
  • ATP synthase – this protein machine builds most of the energy storage molecule ATP, which powers our cellular processes.
  • DNA (PDB 5F9I) – a double-helix strand of DNA, 20 base pairs long.

Fun with Proteins and 3D Printing!

When I’m not postdoc-ing, as part of my job I’m also involved with teaching at the Doctoral Training Centre here in Oxford. I mainly teach the first-year students of the Systems Approaches to Biomedical Science CDT – many members of this group are doing (or have done) their DPhils through this program (including myself!). Recently, I and some other OPIGlets were responsible for two modules called Structural Biology and Structure-Based Drug Discovery, and as part of those modules we arranged a practical session on 3D printing.

Most of the time, the way we ‘see’ protein structures is through a computer screen, using visualisation software such as PyMOL. While useful, these virtual representations have their limitations – since the screen is flat, it’s difficult to get a proper feel for the structure1, and seeing how your protein could interact and form assemblies with others is difficult. Physical, three-dimensional models, on the other hand, allow you to get hands-on with your structure, and understand aspects of your protein that couldn’t be gained from simply looking at images. Plus, they look pretty cool!

This year, I printed three proteins for myself (shown in the photo above). Since my most recent work has focused on transmembrane proteins, I felt it was only right to print one – these are proteins that cross membranes, usually to facilitate the transport of molecules in and out of the cell. I chose the structure of a porin (top of the photo), which (as the name suggests) forms a pore in the cell membrane to allow diffusion across it. This particular protein (1A0S) is a sucrose-specific porin from a type of bacteria called Salmonella typhimurium, and it has three chains (coloured blue, pink and purple in the printed model), each of which has a beta barrel structure. You can just about see in the photo that each chain has regions which are lighter in colour – these are the parts that sit in the cell membrane layer; the darker regions are therefore the parts that stick out from the membrane.

My second printed model was the infamous Zika virus (bottom right). Despite all the trouble it has caused in recent years, in my opinion the structure of the Zika virus is actually quite beautiful, with the envelope proteins forming star-like shapes in a highly symmetrical pattern. This sphere of proteins contains the viral RNA. The particular structure I used to create the model (5IRE) was solved using cryo-electron microscopy, and required aligning over 10,000 images of the virus.

Finally, I printed the structure of a six-residue peptide, that’s probably only interesting to me… Can you tell why?!2

 

1 – However, look at this link for an example of looking at 3D structures using augmented reality!

2 – Hint: Cysteine, Leucine, Alanine, Isoleucine, Arginine, Glutamic Acid…

Journal Club: Statistical database analysis of the role of loop dynamics for protein-protein complex formation and allostery

As I’ve mentioned on this blog a few (ok, more than a few) times before, loops are often very important regions of a protein, allowing it to carry out its function effectively. In my own research, I develop methods for loop structure prediction (in particular for antibody CDR H3), and look at loop conformational changes and flexibility. So, when I came across a paper that has the words ‘loops’, ‘flexibility’ and ‘antibody’ in its abstract, it was the obvious choice to present at my most recent journal club!

In the paper, entitled “Statistical database analysis of the role of loop dynamics for protein-protein complex formation and allostery”, the authors focus on how loop dynamics change upon the formation of protein-protein complexes. To do this, they use an algorithm they previously published called ToeLoop – given a protein structure, this classifies the loop regions as static, slow, or fast, based on both sequential and structural features:

  • relative amino acid frequencies;
  • the frequency of loop secondary structure types as annotated by DSSP (bends, β-bridges etc.);
  • the average solvent accessible surface area;
  • the average hydrophobicity index for the loop residues;
  • loop length;
  • contacts between atoms of the loop and the rest of the protein.

Two scores are calculated using the properties listed above: one that distinguishes ‘static’ loops from ‘mobile’ loops (with a reported 81% accuracy), and another that further categorises the mobile loops into ‘slow’ and ‘fast’ (74% accuracy). Results from the original ToeLoop paper indicate that fast loops are shorter, have more negatively charged residues, larger solvent accessibilities, lower hydrophobicity, and fewer contacts.

Gu et al. use ToeLoop to investigate the dynamic behaviour of loops during protein-protein complex formation. For a set of 230 protein complexes, they classified the loops of the proteins in both their free and complexed forms (illustrated by the figure below).

The loops from 230 protein complexes, in both free and bound forms, were categorised as fast, slow, or static using the ToeLoop algorithm. The loops are coloured according to their predicted dynamics. Allosteric loops, defined as those whose mobility increases upon binding, are indicated using blue arrows.

In the uncomplexed form, the majority of loops were annotated as static (63.6%), followed by slow (26.2%) and finally fast (10.2%). This indicates that most loops are inflexible. After complex formation, the number of static loops increases and the number of mobile loops decreases (67.8%, 23.0%, and 9.2% for static, slow and fast respectively). Mobility, on the whole, is therefore reduced upon binding, which is as expected – the presence of a binding partner restricts the range of possible movement.

The authors then divided the loops into two groups, interface and non-interface, according to the average minimum distance of each loop residue to the binding partner (cutoff values from 4 to 8 Å were tested and each gave broadly similar results). The dynamics of non-interface loops changed less upon binding than those of the interface loops (again, this was as expected). However, an interesting result is that slow loops are more common at the interface than any other parts of the protein, with 37.2% of interface loops being annotated as slow compared to 24.8% of non-interface loops. It is suggested by the authors that this is due to protein promiscuity; i.e. slow loops allow proteins to bind to different partners.

The 4600 loops analysed in the study were split into two groups based on their proximity to the interface. As expected, interface loops are affected more by binding than non-interface loops. Slow loops are more prevalent at the interface than elsewhere on the protein.

Binding-induced dynamic changes were then investigated in more detail, by dividing the loops into 9 categories based on the transition (i.e. static-static, slow-static, slow-fast etc.). The dynamic behaviour of most loops (4120 out of 4600) does not change, and those loops whose mobility decreased upon binding were found close to the interface (average distance of ~12 Å). A small subset of the loops (termed allosteric by the authors) demonstrated an increase in flexibility upon complex formation (142 out of 4600); these tended to be located further away from the interface (average distance of ~30 Å).

One of these allosteric loops was investigated further as part of a case study. The complex in question was an antibody-antigen complex, in which one loop distant from the binding site transitioned from static to slow upon binding. The loops directly involved in binding (the CDRs) either displayed reduced flexibility or remained static. The presence of an allosteric loop was supported by experimental data – the loop is shown to change conformation upon binding (RMSD of 3.6 Å between bound and unbound crystal structures from the PDB), and the average B-factor for the loop atoms increased on complex formation from around 26 Å2 to approximately 140 Å2. The authors also carried out MD simulations of the unbound antibody and antigen as well as the complex, and showed that the loop moved more in the complex than in the free antibody. The authors propose that the increased flexibility of the loop offsets the entropy loss that occurs due to binding, thereby increasing the strength of binding. ToeLoop could, therefore, be a useful tool in the development of antibody therapies (or other protein drugs) – it could be used in tandem with an antibody modelling protocol, allowing the dynamic behaviour of loop regions to be monitored and possibly designed to increase affinities.

Finally, the authors explored the link between loop dynamics and binding affinity. Again, they used ToeLoop to predict the flexibility of loops, but this time the complexes were from a set of 170 with known affinity. They demonstrated that affinity is correlated with the number of static loop residues present at the interface – ‘strong’ binders (those with picomolar affinity) tend to contain more static residues than more weakly binding pairs of proteins. This is in accordance with the theory that the rigidification of flexible loops upon binding leads to lower affinities, due to the loss of entropy.

Addressing the Role of Conformational Diversity in Protein Structure Prediction

For my journal club last week, I chose to look at a recent paper entitled “Addressing the Role of Conformational Diversity in Protein Structure Prediction”, by Palopoli et al [1]. In the study of proteins, structures are incredibly useful tools, offering information about how they carry out their function, and allowing informed decisions to be made in many areas (e.g. drug design). Since the experimental determination is difficult, however, the computational prediction of protein structures has become very important (and a number of us here at OPIG work on this!).

A problem, however, in both experimental structure determination and computational structure prediction, is that proteins are generally treated as static – the output of an X-ray crystallography experiment is a single structure, and in the majority of cases the goal of structure prediction is to produce one model that closely resembles the native structure. The accuracy of structure prediction algorithms is also normally measured by comparing the resulting model to a single, known experimentally-determined structure. The issue here is that proteins are not static – they are constantly moving and may adopt a number of different conformations; the structure observed experimentally is just a snapshot of that motion. The dynamics of a protein may even play an important role in its function; an example is haemoglobin, which after binding to oxygen changes conformation to increase affinity for further binding. It may be more appropriate, then, to represent a protein as an ensemble of structures, and not just one.

Conformational diversity helps the protein haemoglobin carry out its function (the transportation of oxygen in the blood). Haemoglobin has four subunits, each containing a haem group, shown in red. When oxygen binds to this group (blue), a histidine residue moves, shifting the position of an alpha helix. This movement is propagated throughout the entire structure, and increases the affinity for oxygen of the other subunits – binding therefore becomes increasingly easy (this is known as co-operative binding). Gif shown is from the PDB-101 Molecule of the Month series: S. Dutta and D. Goodsell, doi:10.2210/rcsb_pdb/mom_2003_5

How, though, could this be incorporated into protein structure prediction? This is the question being considered by the authors of this paper. They consider conformational diversity by looking at different conformers of the same protein – there are many proteins whose structures have been solved experimentally multiple times, and as such have a number of structures available in the PDB. Information about this is stored in a useful database called CoDNaS [2], which was developed by some of the authors of the paper under discussion. In some cases, there are model (or decoy) structures available for these proteins, generated by various structure prediction algorithms – for example, all models submitted for the CASP experiments [3], where the current accuracy of structure prediction is monitored through blind prediction, are freely available for download. The authors curated a collection of decoy sets for 91 different proteins for which multiple experimental structures are present in the PDB.

As mentioned previously, the accuracy of a model is normally evaluated by measuring its structural similarity to one known (or reference) structure – only one conformer of the protein is considered. The authors show that the model rankings achieved by this are highly dependent on the chosen reference structure. If the possible choices (i.e. the observed conformers) are quite similar the effect is small, but if there is a large difference, then two completely different decoys could be designated as the most accurate depending on which reference structure is used.

The key figure from this paper, in my opinion, is the one shown below. For the two most dissimilar experimentally-observed conformers for each protein in the set, the RMSD of the best decoy in relation to one conformer is plotted against the RMSD of the best decoy when measured against the other:

The straight line on this graph indicates what would be observed if there are decoys in the set that equally represent the two conformers; for example, if the best decoy with reference to conformer 1 has an RMSD of 1 Å, then there is also a decoy that is 1 Å away from conformer 2. Most points are on or near this line – this means that the sets of decoy structures are not biased towards one of the conformers. Therefore, structure prediction algorithms seem to be able to generate models for multiple conformations of proteins, and so the production of an ensemble of models is not an impossible dream. Several obstacles remain, however – although of equal distance to both conformers, the decoys could still be of poor quality; and decoy selection is often inaccurate, and so finding these multiple conformations amongst all others is a challenge.

[1] – Palopoli, N., Monzon, A. M., Parisi, G., and Fornasari, M. S. (2016). Addressing the Role of Conformational Diversity in Protein Structure Prediction. PLoS One, 11, e0154923.

[2] – Monzon, A. M., Juritz, E., Fornasari, S., and Parisi, G. (2013). CoDNaS: a database of conformational diversity in the native state of proteins. Bioinformatics, 29, 2512–2514.

[3] – Moult, J., Pedersen, J. T., Judson, R., and Fidelis, K. (1995). A Large-Scale Experiment to Assess Protein Structure Prediction Methods. Proteins, 23, ii–iv.

Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from  conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

An antibody that exists as two different structures in equilibrium - one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.

Loop Model Selection

As I have talked about in previous blog posts (here and here, if interested!), the majority of my research so far has focussed on improving our ability to generate loop decoys, with a particular focus on the H3 loop of antibodies. The loop modelling software that I have been developing, Sphinx, is a hybrid of two other methods – FREAD, a knowledge-based method, and our own ab initio method. By using this hybrid approach we are able to produce a decoy set that is enriched with near-native structures. However, while the ability to produce accurate loop conformations is a major advantage, it is by no means the full story – how do we know which of our candidate loop models to choose?

loop_decoy_ranking

In order to choose which model is the best, a method is required that scores each decoy, thereby producing a ranked list with the conformation predicted to be best at the top. There are two main approaches to this problem – physics-based force fields and statistical potentials.

Force fields are functions used to calculate the potential energy of a structure. They normally include terms for bonded interactions, such as bond lengths, bond angles and dihedral angles; and non-bonded interactions, such as electrostatics and van der Waal’s forces. In principle, they can be very accurate, however they have certain drawbacks. Since some terms have a very steep dependency on interatomic distance (in particular the non-bonded terms), very slight conformational differences can have a huge effect on the score. A loop conformation that is very close to the native could therefore be ranked poorly. In addition, solvation terms have to be used – this is especially important in loop modelling applications since loop regions are generally found on the surface of proteins, where they are exposed to solvent molecules.

The alternatives to physics-based force fields are statistical potentials. In this case, a score is achieved by comparing the model structure (i.e. its interatomic distances and contacts) to experimentally-derived structures. As a very simple example, if the distance between the backbone N and Cα of a residue in a loop model is 2Å, but this distance has not been observed in known structures, we can assume that a distance of 2Å is energetically unfavourable, and so we can tell that this model is unlikely to be close to the native structure. Advantages of statistical potentials over force fields are their relative ‘smoothness’ (i.e. small variations in structure do not affect the score as much), and the fact that all interactions do not need to be completely understood – if examples of these interactions have been observed before, they will automatically be taken into account.

I have tested several statistical potentials (including calRW, DFIRE, DOPE and SoapLoop) by using them to rank the loop decoys generated by our hybrid method, Sphinx. Unfortunately, none of them were consistently able to choose the best decoy out of the set. The average RMSD (across 70 general loop targets) of the top-ranked decoy ranged between 2.7Å and 4.74Å for the different methods – the average RMSD of the actual best decoy was much lower at 1.32Å. Other researchers have also found loop ranking challenging – for example, in the latest Antibody Modelling Assessment (AMA-II), ranking was seen as an area for significant improvement. In fact, model selection is seen as such an issue that protein structure prediction competitions like AMA-II and CASP allow the participants to submit more than one model. Loop model selection is therefore an unsolved problem, which must be investigated further to enable reliable predictions to be made.