Category Archives: Uncategorized

Inserting functional proteins in an antibody

At the group meeting on the 3rd of February I presented the results of the paper “A General Method for Insertion of Functional Proteins within Proteins via Combinatorial Selection of Permissive Junctions” by Peng et. al. This is interesting to our group, and especially to me, because this is a novel way of designing an antibody, although I suspect that the scope of their research is much more general, their use of antibodies being a proof of concept.

Their premise is that the structure of a protein is essentially secondary structures and tertiary structure interconnected through junctions. As such it should be possible to interconnect regions from different proteins through junctions, and these regions should take up their native secondary and tertiary structures, thus preserving their functionality. The question is what is a suitable junction? This is important because these junctions should be flexible enough to allow the proper folding of the different regions, but also not too flexible as to have a negative impact on stability. There has been previous work done on trying to design suitable junctions, however the workflow presented in this paper is based on trying a vast number of junctions and then identifying which of them work.

As I said above their proof concept is antibodies. They used an antibody scaffold (the host), out of which they removed the H3 loop and then fused to it, using junctions, two different proteins: Leptin and FSH (the guests). To identify the correct junctions they generated a library of antibodies with random three residues sequences on either side of the inserted protein plus a generic linker (GGGGS) that can be repeated up to three times.

They say that the theoretical size of the library is 10^9 (however I would say it is 9*20^6), and the actually achieved diversity of their library was of size 2.88*10^7 for Leptin and 1.09*10^7. Next step is to identify which junctions have allowed the guest protein to fold properly. For this they devised an autocrine-based selection method using engineered cells that have beta lactamase receptors which have either Leptin or FSH as agonists. A fluoroprobe in the cell responds to the presence of beta lactamase producing a blue color, instead of green and therefore this allows the cells with the active antibody-guest designed protein (clone) to be identified using FRET-based fluorescence-activated cell sorting.

They managed to identify 6 clones that worked for Leptin and 3 that worked for FSH with the linkers being listed in the below table:

There does not seem to be a pattern emerging from those linker sequences, although one of them repeats itself. For my research it would have been interesting if a pattern did emerge, and then that could be used as a generic linker for future designers. However, this is still another prime example of how well an antibody scaffold can be used a starting point for protein engineering.

As a bonus they also tested in vivo how their designs work and they discovered that the antibody-leptin design (IgG-Leptin) has a longer lifetime. This is probably due to the fact that being a larger protein this is not filtered out by the kidneys.

Designing antibodies targeting disordered epitopes

At the meeting on February 10 I covered the article by Sormanni et al. describing a methodology for computationally designing antibodies against intrinsically disordered regions of proteins.

Antibodies are proteins that are a natural part of our immune system. For over 50 years lab-made antibodies have been used in a wide variety of therapeutic and diagnostic applications. Nowadays, we can design antibodies with high specificity and affinity for almost any target. Nevertheless, engineering antibodies against intrinsically disordered proteins remains costly and unreliable. Since as many as about 33.0% of all eukaryotic proteins could be intrinsically disordered, and the disordered proteins are often implicated in various ailments and diseases such methodology could prove invaluable.

Cascade design

The initial step in the protocol involves searching the PDB for protein sequences that interact in a beta strand with segments of the target sequence. Next, such peptides are joined together using a so-called “cascade method”. The cascade method starts with the longest found peptide and grows it to the length of the target sequence by joining it with other, partially overlapping peptides coming from beta strands of the same type (parallel, antiparallel). In the cascade method, all fragments used must form the same hydrogen bond pattern. The resulting complementary peptide is expected to “freeze” part of the discorded protein by forcing it to locally form a beta sheet. After the complementary peptide is designed, it is grafted on a single-domain antibody scaffold. This decision has been made as antibodies have a longer half-life and lower immunogenicity.

To test their method the authors initially assessed the robustness of their design protocol. First, they run the cascade method on three targets – a-synuclein, Aβ42 and IAPP. They found that more than 95% of the residue position in the three proteins could be targeted by their method. In addition, the mean number of available fragments per position was 570. They also estimated their coverage on a larger scale, using 1690 disordered protein sequences obtained from DisProt database and from measured NMR chemical shifts. About 90% of residue positions from DisProt and 85% positions from the chemical shift could be covered by at least one designed peptide. The positions that were hard to target usually contained Proline, in agreement with the known result that Prolines tend to disrupt secondary structure formation.

To test the quality of their designs the authors created complementary peptides for a-synuclein, Aβ42 and IAPP and grafted them on the CDR3 region of a human single domain antibody scaffold. All designs were highly stable and bound their targets with high specificity. Following the encouraging result the authors measured the affinity of one of their designs (one of the anti-a-synuclein antibodies). The K_d was found to lie in the range 11-27 μM. Such affinity is too low for pharmaceutical purposes, but it is enough to prevent aggregation of the target protein.

As the last step in the project the authors attempted a two-peptide design, where a second peptide was grafted in the CDR2 region of the single-domain scaffold. Both peptides were designed to bind the same epitope. The two peptide design managed to reach the affinity required for pharmaceutical viability (affinity smaller than 185 nM with 95% confidence). Nevertheless, the two loop design became very unstable rendering it not viable for pharmaceutical purposes.

Overall, this study presents a very exciting step towards computationally designed antibodies targeting disordered epitopes and deepens out understanding of antibody functionality.

Network Hubs

Some times real networks contain few nodes that are connected to a large portion of the nodes in the network. These nodes, often called ‘hubs’ (or global hubs), can change global properties of the network drastically, for example the length of the shortest path between two nodes can be significantly reduced by their presence.

The presence of hubs in real networks can be easily observed, for example, in flight networks airports such as Heathrow (UK) or Beijing capital IAP (China) have a very large number of incoming and outgoing flights in comparison to all other airports in the world. Now, if in addition to the network there is a partition of the nodes into different groups ‘local hubs’ can appear. For example, assume that the political division is a partition of the nodes (airports) into different countries. Then, some capital city airports can be local hubs as they have incoming and outgoing flights to most other airports in that same country. Note that a local hub might not be a global hub.

There are several ways to classify nodes based on different network properties. Take for example, hub nodes and non-hub nodes. One way to classify nodes as hub or non-hub uses the participation coefficient and the standardised within module degree (Gimera & Amaral, 2005).

Consider a partition of the nodes into $N_M$ groups. Let $k_i$ be the degree of node $i$ and $k_{is}$ the number of links or edges to other nodes in the same group as node $i$ . Then, the participation coefficient of node $i$ is:

$P_i = 1 - \sum_{s=1}^{N_M} k_{is}^2 / k_i^2$ .

Note that if node $i$ is connected only to nodes within its group then, the participation coefficient of node $i$ is 0. Otherwise if it is connected to nodes uniformly distributed across all groups then the participation coefficient is close to 1 (Gimera & Amaral, 2005).

Now, the standardised within module degree:

$z_i= (k_i - \bar{k}_{s_i}) / \sigma_{k_{s_i}}$ ,

where $s_i$ is the group node $i$ belongs to and $\sigma_{k_{s_i}}$ is the standard deviation of $k$ in such group.

Gimera & Amaral (2005) proposed a classification of the nodes of the network based on their corresponding values of the previous statistics. In particular they proposed a heuristic classification of the nodes depicted by the following plot

Image taken from the paper “Functional cartography of complex
metabolic networks” by Guimera and Amaral, 2005.

Guimera and Amaral (2005), named regions R1-R4 as non-hub regions and R5-R7 as hub regions. Nodes belonging to: R1 are labelled as ultra-peripheral nodes, R2 as peripheral nodes, R3 as nun-hub connector nodes, R4 as non-hub kinless nodes, R5 as provincial nodes, R6 as connector hubs and R7 as kinless hubs. For more details on this categorisation please see Guimera and Amaral (2005).

The previous regions give an intuitive classification of network nodes according to their connectivity under a given partition of the nodes. In particular it gives an easy way to differentiate hub nodes of non-hub nodes. However the classification of the nodes into these seven regions (R1-R7) depends on the initial partition of the nodes.

R. Guimerà, L.A.N. Amaral, Functional cartography of complex metabolic networks, Nature 433 (2005) 895–900

We can model everything, right…?

First, happy new year to all our Blopig fans, and we all hope 2016 will be awesome!

A couple of months ago, I was covering this article by Shalom Rackovsky. The big question that jumps out of the paper is, has modelling reached its limits? Or, in other words, can bioinformatics techniques be used to model every protein? The author argues that protein structures have an inherent level of variability that cannot be fully captured by computational methods; thus, he raises some scepticism on what modelling can achieve. This isn’t entirely news; competitions such as CASP show that there’s still lots to work on in this field. This article takes a very interesting spin when Rackovsky uses a theoretical basis to justify his claim.

For a pair of proteins P and Q, Rackovsky defines their relationship depending on their sequence and structural identity. If P and Q share a high level of sequence identity but have little structural resemblance, P and Q are considered to be a conformational switch. Conversely, if P and Q share a low level of sequence identity but have high structural resemblance, they are considered to be remote homologues.

Case of a conformational switch – two DNAPs with 100% seq identity but 5.3A RMSD.

Haemoglobins are ‘remote homolgues’ – despite 19% sequence identity, these two proteins have 1.9A RMSD.

From here on comes the complex maths. Rackovsky’s work here (and in papers prior, example) assume that there are periodicities in properties of proteins, and thus apply fourier transforms to compare protein sequences and structures.

In the case of comparing protein sequences, instead of treating sequences as a string of letters, protein sequences are characterised by an N x 10 matrix. N represents the number of amino acids in protein P (or Q), and each amino acid has 10 biophysical properties. The matrix then undergoes Fourier Transformation (FT), and the resulting sine and cosine coefficients for proteins P and Q are used to calculate the Euclidean distance between each other.

When comparing structures, proteins are initially truncated into length-L fragments, and the dihedral angle, bond length and bond angle for each fragment is collected into a matrix. The distribution of matrices allows us to project proteins onto a pre-parameterised principal components space. The Euclidean distance between the newly-projected proteins is then used to quantify protein structural similarity.

In both sequence and structure distances, the distances are normalised and centred around 0,0 by calculating the average distance between P and its M-nearest neighbours, and then adjusted by the global average. Effectively, if a protein has an average structural distance, it will tend toward 0,0.

The author uses a dataset of 12000 proteins from the CATH set to generate the following diagram; the Y-axis represents sequence similarity and the X-axis is the structural similarity. Since these axes are scaled to the mean, the closer you are to 0, it means you’re closer to the global average sequence or structure distance.

The four quadrants: along the diagonal is a typical linear relationship (greater sequence identity = more structural similarity). The lower-right quadrant represents proteins with LOW sequence similarity yet HIGH structural similarity. In the upper-left quadrant, proteins have LOW structural similarity but HIGH sequence similarity.

Rackovsky argues that, while the remote homologue and conformational switch seem like rare phenomena, it accounts for approximately ~50% of his dataset. Although he does account for the high density of proteins within 0,0, the paper does not clearly address the meaning of these new metrics. In other words, the author does not translate these values to something we’re more familiar with (e.g.RMSD, and sequence identity % for structural and sequence distance). Although the whole idea is that his methods are supposed to be an alignment-free method, it’s still difficult to draw relationships to what we already use as the gold standard in traditional protein structure prediction problems. Also, note that the structure distance spans between -0.1 and 0.1 units whereas sequence identity spans between -0.3 and 0.5. The differences in scale are also not covered – i.e., is a difference of 0.01 units an expected value for protein structure distance, and why are the jumps in protein structure distance so much smaller than jumps in sequence space?

The author makes more interesting observations in the dataset (e.g. α/β mixed proteins are more tolerant to mutations in comparison to α- or β-only proteins) but the observations are not discussed in depth. If α/β-mixed proteins are indeed more resilient to mutations, why is this the case? Conversely, if small mutations change α- or β-only proteins’ structures to make new folds, having any speculation on the underlying mechanism (e.g. maybe α-only proteins are only sensitive to radically different amino acid substitutions, such as ALA->ARG) will only help our prediction methods. Overall I had the impression that the author was a bit too pessimistic about what modelling can achieve. Though we definitely cannot model all proteins that are out there at present, I believe the surge of new sources of data (e.g. cryo-EM structures) will provide an alternative inference route for better prediction methods in the future.

Isoform-resolved protein interaction networks and poker

Every time I talk about protein interaction networks, I put up a the nice little figure below. This figure suggests that experiments are done to detect which proteins interact with each other, and lines are drawn between them to create a network. Sounds pretty easy, right? It does, and it probably should, because otherwise you wouldn’t listen past the words “Yeast Two-Hybrid”. However, sometimes it’s important to hear about how it’s not at all easy, how there are compromises made, and how those compromises mean there are some inherent limitations. This post is one of those things it is important to listen to if you care about protein interaction networks (and if you don’t… they’re pretty cool, so don’t be a downer!).

Classical image showing the design of a protein interaction network

Schematic image showing the assembly of a human protein interaction network with ~25 000 interactions

So what’s wrong with that nice figure? Just to deal with the obvious: the colour scheme isn’t great… there’s a red protein and the interactions are also red… and then a yellow protein and red interactions aren’t that easy on the eyes. But you’re not here to judge my choice of colours (I hope… otherwise you’d be better off criticizing the graphs in this beast). You’re here to hear me rant about networks… much more fun ;). So here goes:

Not all interactions come from the same experiments, they investigate different “types” of binding.
Each experiment has experimental errors associated with it, the interactions are not all correct.
The network is not complete (estimation is that there are somewhere between 150k and 600k interactions depending on what paper you read).
People tend to investigate proteins that they know are associated with a disease, so more interactions are known for certain proteins and their neighbours resulting in an “inspection bias”.
That’s a stupid depiction of a network, it’s just a blue blob and you can’t see anything (hence termed “ridiculogram” by Mark Newman).
The arrow is wrong, you are not reporting interactions between proteins!

Points 1-5 should more or less make sense as they’re written. Point 6 however sounds a little cryptic. They’re called PROTEIN interaction networks, why would they not be interactions between proteins, you might ask.. and you’d be right in asking, because it’s really quite annoying. The problem is one of isoforms. The relation gene to protein is not 1-to-1. After a gene is transcribed into RNA, that piece of RNA is cut up and rearranged into mRNA, which is then translated into a protein. This rearranging process can occur in different ways so that a single gene may encode for different proteins, or as they are called, different protein isoforms. Testing for interactions between isoforms is difficult, so what tends to be done is that people test for interactions for THE ONE isoform of a protein to rule them all (the “reference isoform”) and then report these interactions as interactions for the gene. Sneaky! What you end up seeing are interactions mostly tested between reference isoforms (or any that happened to be in the soup) and reported as interactions for the gene product.

So how much does it matter if we don’t know isoform interaction information? Are there even different interacting partners for different isoforms? Do they have different functions? Well… yes, they can have different interactions and different functions. As to whether they matter… according to Corominas et al that answer is also a resounding yes… or at least in Autism Spectrum Disorder (ASD) it is.

The paper is the result of a 5-year investigation which investigates isoform interactions and the effect of knowing them vs not knowing them on predicting candidate ASD genes of interest. And seeing as a bunch of people spent a lot of time on this stuff, it was definitely worth a read. Corominas et al found that in an ASD-related protein interaction network, there is a significant number of interactions that would not be found if only the reference isoform interactions were used. Furthermore, compared to a “high-quality” literature curated network, the generated isoform-resolved ASD network added a lot of interactions. They then went on to show that these additional interactions played an important role in the decision of which genes to prioritize as important “players in Autism”.

Should these results make us renounce the use of currently available non-isoform-resolved protein interaction networks lest we burn in the depths of bioinformatics hell? Well… probably not. While the paper is very interesting and shows the importance of isoforms, it does so in the context of Autism only. The paper itself states that ASD is a brain-related disease which is an environment known for many isoforms. In many cases, it will likely be the case that the “dominant isoform” is just that… dominant. Moreover, the results may sound a little stronger than they are. The literature curated network that was compared to, to say that this isoform-resolved network is really important, was quoted as being “high-quality”. It is likely that many of the isoform interactions would be included in lower quality networks, but they have simply not been as well-studied as dominant isoforms. Thus, their isoform-resolved network would just confirm lower quality interactions as high-quality ones. That being said, if you want to look at the specific mechanisms causing a phenotype, it is likely that isoform information will be necessary. It really depends on what you want to achieve.

Let’s say you’re playing Texas Hold’em poker and you have two pairs. You’d like to have that full house, but it’s elusive and you’re stuck with the hand you have when your opponent bids high. That’s the situation we were in with protein interaction networks: you know you’re missing something, but you don’t know how bad it is that you’re missing it. This paper addresses part of that problem. We now know that your opponent could have the flush, but possibly only if you’re in Vegas. If you only want to play in a local casino, you’ll likely be fine.

Molecular Diversity and Drug Discovery

For my second short project I have developed Theox, molecular diversity software, to aid the selection of synthetically viable molecules from a subset of diverse molecules. The selection of molecules for synthesis is currently based on synthetic intuition. The developed software indicates whether the selection is an adequate representation of the initial dataset, or whether molecular diversity has been compromised. Theox plots the distribution of diversity indices for 10,000 randomly generated subsets of the same size as the chosen subset. The diversity index of the chosen subset can then be compared to the distributions, to determine whether the molecular diversity of the chosen subset is sufficient. The figure shows the distribution of the Tanimoto diversity indices with the diversity index of the subset of molecules shown in green.

A designed conformational shift to control protein binding specificity

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Molecular Dynamics of Antibody CDRs

Efficient design of antibodies as therapeutic agents requires understanding of their structure and behavior in solution. I have recently performed molecular dynamics simulations to investigate the flexibility and solution dynamics of complementarity determining regions (CDRs). Eight structures of the Fv region of antibody SPE7 were found in the Protein Data Bank with identical sequences. Twenty-five replicas of 100 ns simulations were performed on the Fvregion of one of these structures to investigate whether the CDRs adopted the conformation of one of the other X-Ray structures. The simulations showed the H3 and L3 loops start from one conformation and adopt another experimentally determined conformation.

This confirms the potential of molecular dynamics to be used to investigate antibody binding and flexibility. Further investigation would involve simulating different systems, for example using solution NMR resolved structures, and comparing the conformations deduced here to the canonical forms of CDR loops. Looking forward it is hoped molecular dynamics could be used to predict the bound conformation of an antibody from the unbound structure.

Click here for simulation videos.

Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).

Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.

Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.

Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.

References

[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368

Protein loops – why do we care?

In my DPhil research, I work on the development of new methods for predicting protein loop structures. But what exactly are loops, and why should we care about their structures?

Many residues in a given protein will form regions of regular structure, in α-helices and β-sheets. The segments of the protein that join these secondary structure elements together, that do not have easily observable regular patterns in their structure, are referred to as loops. This does not mean, though, that loops are only a minor component of a protein structure – on average, half of the residues in a protein are found in loops [1], and they are typically found on the surface of the protein, which is largely responsible for its shape, dynamics and physiochemical properties [2].

Connecting different secondary structures together is often not the only purpose of a loop – they are often vitally important to a protein’s function. For example, they are known to play a role in protein-protein interactions, recognition sites, signalling cascades, ligand binding, DNA binding, and enzyme catalysis [3].

As regular readers of the blog are probably aware by now, one of the main areas of research for our group is antibodies. Loops are vital for an antibody’s function, since its ability to bind to an antigen is mainly determined by six hypervariable loops (the complementarity determining regions). The huge diversity in structure displayed by these loops is the key to how antibodies can bind to such different substances. Knowledge of loop structures is therefore extremely useful, enabling predictions to be made about the protein.

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

More insertions, deletions and substitutions occur in loops than in the more conserved α-helices and β-sheets [4]. This means that, for a homologous set of proteins, the loop regions are the parts that vary the most between structures. While this often makes the protein’s function possible, as in the case of antibodies, it leads to unaligned regions in a sequence alignment, standard homology modelling techniques can therefore not be used. This makes prediction of their structure difficult – it is frequently the loop regions that are the least accurate parts of a protein model.

There are two types of loop modelling algorithm: knowledge-based and ab initio. Knowledge-based methods look for appropriate loop structures from a database of previously observed fragments, while ab initio methods generate possible loop structures without prior knowledge. There is some debate about with approach is the best. Knowledge-based methods can be very accurate when the target loop is close in structure to one seen before, but perform poorly when this is not the case; ab initio methods are able to access regions of the conformational space that have not been seen before, but fail to take advantage of any structural data that is available. For this reason, we are currently working on developing a new method that combines aspects of the two approaches, allowing us to take advantage of the available structural data whilst allowing us to predict novel structures.

[1] L. Regad, J. Martin, G. Nuel and A. Camproux, Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics, 2010, 11, 75.

[2] A. Fiser and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003, 19, 2500-2501.

[3] J. Espadaler, E. Querol, F. X. Aviles and B. Oliva, Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics, 2006, 22, 2237-2243.

[4] A. R. Panchenko and T. Madej, Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evolutionary Biology, 2005, 5, 10.

Oxford Protein Informatics Group

or "OPIG" to friends