Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

The Ten Commandments of OPIG

In OPIG one must learn, and one must learn FAST! However, sometimes stupidity in OPIG knows no limits (*cough* James *cough* Anne *cough*), so for the newer (and prospective) members of the group, I thought it wise to share the some ground rules, a.k.a. The Ten Commandments of OPIG.

Vaguely adhering to these will drastically improve your time in OPIG (see Exhibit A), and let’s face it, none of them are particularly challenging.

  1. No touchy the supervisor.
  2. No touchy other students.
  3. You’re not late unless you’re after Charlotte. Don’t be late.
  4. All prizes are subject to approval by The Party.
  5. Thou shalt not tomate.
  6. Any and all unattended food is fair game.
  7. Meetings (especially the one before yours) will go on as long as they have to.
  8. Finish your DPhil or die.
  9. This is not a democracy.
  10. NO TOUCHY THE SUPERVISOR!

Bonus (and final rule). If this is your first time at Group Meeting, you have to present (well at least introduce yourself).

P.s. we’re not that bad, I promise!

Disclaimer: while I’ve categorised this post as “humour”, I take no responsibility for your enjoyment.

My experience with (semi-)automating organic synthesis in the lab

After three years of not touching a single bit of glassware, I have recently donned on the white coat and stepped back into the Chemistry lab. I am doing this for my PhD project to make some of the follow-up compounds that my pipeline suggests. However, this time there is a slight difference – I am doing reactions with the aid of a liquid handler robot, the Opentrons. This is the first encounter that I have with (semi-)automated synthesis and definitely a very exciting opportunity! (Thanks to my industrial sponsor, Diamond Light Source!)

A picture of the Opentrons machine I have been using to do some organic reactions. Picture taken from https://opentrons.com/robots.

Opentrons is primarily used by biologists and their goal is to make a platform to easily share protocols and reproduce each other’s work (I think we can all agree how nice this would be!). They provide a very easy to use API, wishing it to be accessible to any bench scientist with basic computer skills. From my experience so far, this has been the case as I found it extremely easy to pick up and write my own protocols for chemical reactions. Here is the command that will: (1) pick up a new pipette tip; (2) transfer a volume from source1 to destination1; (3) drop the pipette tip in the trash; (4) pick up a new pipette tip; (5) transfer a volume from source2 to destination2; (5) drop the pipette tip in the trash.

pipette.transfer(volume, [source1, source2], [destination1, destination2], new_tip=’always')

But of course not everything is plain sailing – there are many challenges you will encounter by using an automated pipette. The robot is a liquid handler – it cannot handle solids so either the solids need to be pre-weighed and/or made into solution beforehand. Further difficulties lie within the properties of the solvent it is handling, for example:

  • Dripping – low boiling point solvents tend to drip more.
  • Viscosity of liquids causes issues with not drawing up the correct amount of liquid – more viscous liquids require longer times to aspirate and if aspiration is too quick then air pockets may be drawn up.

Here is a GIF I made of a dry run I was doing with the robot (sorry for the slight shake, this was recorded on my phone in the lab… See their website for professional footage of the robot!)

My (shaky) footage of a dry run I was performing with the Opentrons.

Measuring correlation

Correlation is defined as how close two variables are to having a dependence relationship with each other. At first sight, it looks kind of simple, but there are two main problems:

  1. Despite the obvious situations (i.e. correlation = 1), it is difficult to say whether 2 variables are correlated or not (i.e correlation = 0.7). For instance, would you be able to say if the variables X and Y from the following to plots are correlated?
  2. There are different ways of measure of correlation that may not agree when comparing different distributions. As an example, which plot shows a higher correlation? The answer will depend on how you do measure the correlation since if you use Pearson correlation, you would pick A whereas if you choose Spearman correlation you will take B

Here, I will explain some of the different correlation measures you can use:

Pearson product-moment correlation coefficient

  • What does it measure? Only linear dependencies between the variables.
  • How it is obtained? By dividing the covariance of the two variables by the product of their standard deviations. (It is defined only if both of the standard deviations are finite and nonzero). \rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}
  • Properties:
  1. ρ (X,Y) = +1 : perfect direct (increasing) linear relationship (correlation).
  2. ρ (X,Y) = -1 : perfect decreasing (inverse) linear relationship (anticorrelation).
  3. In all other cases, ρ (X,Y) indicates the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
  4. Only gives a perfect value when X and Y are related by a linear function.
  • When is it useful? For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of r, Pearson’s product-moment coefficient.

 

Spearman’s rank correlation coefficient:

  • What does it measure? How well the relationship between two variables can be described using a monotonic function (a function that only goes up or only goes down).
  • How it is obtained? Pearson correlation between the rank values of the two variables.

{\displaystyle r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}}

Only if all n ranks are distinct integers, it can be computed using the popular formula.

{\displaystyle r_{s}={1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}.}

Where di is the difference between the two ranks of each observation.

  • Properties:
  1. rs (X,Y) = +1:  X and Y are related by any increasing monotonic function.
  2. rs (X,Y) = -1:  X and Y are related by any decreasing monotonic function.
  3. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Kendall’s tau coefficient

  • What does it measure? The ordinal association between two measured quantities.
  • How it is obtained?

{\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}

Any pair of observations (xi , yi)  and (xj, yj) are said to be concordant if the ranks for both elements agree. That happens if xi-xj and yi-xj have the same sign. If their sign are different, they are considered as discordant pairs

  • Properties:
  1. τ (X,Y) = +1: The agreement between the two rankings is perfect (i.e., the two rankings are the same)
  2. τ (X,Y) = -1: The disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other)
  3. If X and Y are independent, then we would expect the coefficient to be approximately zero.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Distance correlation:

  • What does it measure? Both linear and nonlinear association between two random variables or random vectors.
  • How is it obtained? By dividing the variable’s distance covariance by the product of their distance standard deviations:

\operatorname {dCor}(X,Y)={\frac {\operatorname {dCov}(X,Y)}{{\sqrt {\operatorname {dVar}(X)\,\operatorname {dVar}(Y)}}}},

The distance covariance is defined as:

{\displaystyle \operatorname {dCov} _{n}^{2}(X,Y):={\frac {1}{n^{2}}}\sum _{j=1}^{n}\sum _{k=1}^{n}A_{j,k}\,B_{j,k}.}

Where:

{\displaystyle A_{j,k}:=a_{j,k}-{\overline {a}}_{j\cdot }-{\overline {a}}_{\cdot k}+{\overline {a}}_{\cdot \cdot },\qquad B_{j,k}:=b_{j,k}-{\overline {b}}_{j\cdot }-{\overline {b}}_{\cdot k}+{\overline {b}}_{\cdot \cdot },}

{\begin{aligned}a_{{j,k}}&=\|X_{j}-X_{k}\|,\qquad j,k=1,2,\ldots ,n,\\b_{{j,k}}&=\|Y_{j}-Y_{k}\|,\qquad j,k=1,2,\ldots ,n,\end{aligned}}

where || ⋅ || denotes Euclidean norm.

  • Properties:
  1. dCor (X,Y) = 0 if and only if the random vectors are independent.
  2. dCor (X,Y) = 1: Perfect dependence between the two distributions.
  3. dCor (X,Y) is defined for X and Y in arbitrary dimension.
  • When is it useful? It is appropriate to find any kind  dependence relationships between the 2 variables. Also if X and Y have different dimensions.

New avenues in antibody engineering

Hi everyone,

In this blog post I would like to review an unusual antibody scaffold that can potentially give rise to a new avenue in antibody engineering. Here, I will discuss a couple of papers that complement each others research.

My DPhil is centered on antibody NGS (Ig-seq) data analysis. I always map an antibody sequence to its structure as the three-dimensional antibody configuration dictates its function, the piece of information that cannot be obtained from just the nucleotide or amino acid sequence. When I work with human Ig-seq data, I bear in mind that antibodies are composed of two pairs of light and heavy chains that tune the antibody towards its cognate antigen. In the light of recent research discoveries, Tan et al., found that antibody repertoires of people that live in malaria endemic regions have adopted a unusual property to defend the body from the pathogen (1). Several studies followed up on this discovery to further dissect the yet uncharacterized property of antibodies.

Malaria parasites in the erythrocytic stage produce RIFIN proteins that are displayed on the surface of the erythrocytes. The main function of RIFINs is to bind to the LAIR1 receptors that are found on the surface on the immune cells. The LAIR1 receptor is inhibitory, which leads to inhibition of the immune system. The endogenous ligand of the LAIR1 receptor is collagen, which is found on the surface of body cells. This is to make sure that the immune cells will not be activated against its own body. Activating the LAIR1 receptors is one of the escape mechanisms that the malaria parasite has evolved.

Tan et al., (1) showed that in an evolutionary arms race between human and malaria, our immune system has harnessed the property of RIFINs to bind to LAIR1 against the parasite itself. By doing single B cell isolation and sequencing, it was discovered that antibodies, which are the effector molecules of our immune system, can incorporate the LAIR1 protein in its structure. Taking into account our knowledge of antibody engineering, the idea of incorporating a 100 amino acid long protein into antibody structure is very hard to comprehend. Sequences of these antibodies showed that the LAIR1 insertion was introduced to CDR-H3. Recently, the crystal structure of this construct has become available (2). The crystal structure revealed that the LAIR1 insertion indeed is structurally functional. All 5 of antibody canonical CDRs interact with the LAIR1 protein and its linkers to accommodate the insertion. The CDR-L3 forms two disulfide bonds with the liker to orientate the LAIR1 protein in the way, it will interact with RIFINs. It is worth to stress that LAIR1 sequence differs from the wild type, but the structure is very similar (<0.5 RMSD). The change in sequence and structure is crucial to prevent the LAIR1 containing antibody from interacting with collagen, but only with RIFINs.

Pieper et al., (3) tried to interrogate the modality of LAIR1 insertions into antibody structures. It was performed by single cell sequences as well as NGS of the antibody shift region. It turns out that human antibodies can accommodate two types of insertion modalities and can form   camelid-like antibodies. The insertion of LAIR1 can happen to CDR-H3, leading to the loss of antibody binding to its cognate antigen. Another modality is the incorporation of the LAIR1 protein to the shift region of the antibody. This kind of insertion does not interfere with the Fv domain binding properties, which leads to creating of  bi-specific antibodies. The last finding was the insertion of the LAIR1 into antibody structure where D, J and most of V genes, and the light chain were deleted. The resultant scaffold is structurally viable and only possesses the heavy chain. Hence, it is the evidence that human antibodies can also form camelid-like antibodies. Interestingly, these insertions into the shift region are not exclusive to people that live in malaria endemic regions. By doing NGS of the shift domain from European donors, around 1 in 1000 antibody sequences had an insertion of varying lengths. These insertions are introduced from different chromosomes of both intergenic and genic regions.

To sum up, it is very intriguing that our immune system has evolved to create camelid-like and bi-specific antibodies. It will be very informative to try to crystallize these structures to see how these antibodies accommodate the insertion of LAIR1. Current antibody NGS data analysis primarily concentrates on the heavy chain due to sequencing technology limitations. It will be invaluable information if we could sequence the entire heavy chain as well as adjacent shift region to see how our immune system matures and activates against pathogens.

 

  1. Tan J, Pieper K, Piccoli L, Abdi A, Foglierini M, Geiger R, Maria Tully C, Jarrossay D, Maina Ndungu F, Wambua J, et al. A LAIR1 insertion generates broadly reactive antibodies against malaria variant antigens. Nature (2016) 529:105–109. doi:10.1038/nature16450
  2. Hsieh FL, Higgins MK. The structure of a LAIR1-containing human antibody reveals a novel mechanism of antigen recognition. Elife (2017) 6: doi:10.7554/eLife.27311
  3. Pieper K, Tan J, Piccoli L, Foglierini M, Barbieri S, Chen Y, Silacci-Fregni C, Wolf T, Jarrossay D, Anderle M, et al. Public antibodies to malaria antigens generated by two LAIR1 insertion modalities. Nature (2017) 548:597–601. doi:10.1038/nature23670

 

Helpful resources for people studying therapeutic antibodies

My work within OPIG involves studying therapeutic antibodies. It can be tough to find information about these commercial molecules, often known by unintelligible developmental names until the later stages of clinical trials. Their structures are frequently absent, as one might expect, but even their sequences are sometimes a nightmare to get hold of! Below is a list of resources that I have found particularly helpful.

IDENTITIES OF RELEVANT ANTIBODIES

1. Wikipedia (don’t judge!) is an extremely helpful resource to get started. They have the following databases:

(a) A list of FDA-approved therapeutic monoclonal antibody therapies
(b) A more general list of therapeutic, diagnostic and preventive monoclonal antibodies (includes some things that have been withdrawn)

2. The Antibody Society has list of FDA/EU approved and antibodies to watch on their website. NB: This is only available to members of the society (free for students and other concessions, standard membership is $100pa).

3. The journal ‘mAbs’ also has a series of ‘Antibodies to Watch in [Year]’ papers. Here are the ones for 2016, 2017 and 2018.

SEQUENCES

4. 137 clinical-stage (post-phase I) mAb sequences can be found in the SI of this paper by Jain et al.

5. A slightly outdated (last updated Nov 2016), but still extremely useful, resource of antibody seqeunces is this FASTA list, written by Dr Martin’s Group at UCL.

SEQUENCES & STRUCTURES

6. The IMGT monoclonal antibody database (mAb-DB) has been possibly the most helpful resource. This includes 798 entries of both therapeutics and non-therapeutics, so it’s helpful to get a list of the antibodies you are interested in first. You can search it with a wide range of parameters, including antibody name. A typical antibody result will include its mAb-DB ID, INN details, common & developmental names, species, receptor type and isotype, sequence (via the “IMGT/2Dstructure-DB” link), target, clinical trials details and – if available – the 3D structure (via the “IMGT/3Dstructure-DB” link).

7. SAbDab has a continually-updated section for all therapeutic antibody structures deposited in the PDB.

CURRENT STATUS OF THE THERAPEUTIC

8. Search the therapeutic name on AdisInsight, or Pharmacodia to see its current clinical trial status, and whether or not it has been withdrawn.

Biophysical Society 62nd Annual Meeting

In February I was very fortunate to attend the Biophysical Society 62nd Annual Meeting, which was held in San Francisco – my first real conference and my first trip to North America. Despite arriving with the flu, I had a great time! The conference took place over five days, during which there were manageable 15-minute talks covering a huge range of Biophysics-related topics, and a few thousand more posters on display (including mine). With almost 6,500 attendees, it was also large enough to slip across the road to the excellent SF Museum of Modern Art without anyone noticing.

The best presentation of the conference was, of course, Saulo’s talk on integrating biological folding features into protein structure prediction [1]. Aside from that, here are a few more of my favourites:

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

Here in the COFFEE (COtranslational Folding Family of Expert Enthusiasts) office, we love to talk about the vectorial nature of cotranslational folding and how it contributes to the efficiency of protein folding in vivo. Micayla Bowman and Patricia Clark have created a novel technique that will allow the effects of this vectorial folding to be investigated specifically in vitro.

The Clp complex grabs, unfolds and degrades proteins (diagram from [3]). ClpX, the translocase unit of this complex, was used to recapitulate vectorial protein refolding in vitro for the first time.

ClpX is an A+++ molecular motor that grabs proteins and translocates them through its pore. In vivo, its role is to denature substrates and feed them to an associated protease (ClpP) [3]. Bowman & Clark have used protein tags to initiate translocation of the target protein through ClpX, resulting in either N-C or C-N vectorial refolding.

The YKB construct used to demonstrate the vectorial folding mediated by ClpX (diagram from [4]).

They demonstrate the effect using YKB, a construct with two mutually exclusive native states: YK-B (fluoresces yellow) and Y-KB (fluoresces blue) [4]. In vitro refolding results in an equal proportion of yellow and blue states. Cotranslational folding, which proceeds in the N-C direction, biases towards the yellow (YK-B) state. C-N refolding in the presence of ClpX and ATP biases towards the blue (Y-KB) state. With this neat assay, they demonstrate that ClpX can mediate vectorial folding in vitro, and they plan to use the assay to investigate its effect on protein folding pathways and yields.

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

This work addresses the ambiguity of domain definition by assigning multiple possible domain boundaries to protein structures. Their automated method, SWORD (Swift and Optimised Recognition of Domains), performs protein partitioning via the hierarchical clustering of protein units (PUs) [7], which are smaller than domains and larger than secondary structures. The structure is first decomposed into protein units, which are then merged depending on the resulting “separation criterion” (relative contact probabilities) and “compactness” (contact density).

Their method is able to reproduce the multiple conflicting definitions that often exist between domain databases such as SCOP and CATH. Additionally, they present a number of cases for which the alternative domain definitions have interesting implications, such as highlighting early folding regions or functional subdomains within “single-domain” structures.

Alternative SWORD domain delineations identify (R) an ultrafast folding domain and (S,T) stable autonomous folding regions within proteins designated single-domain by other methods [6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

The authors of this work used optical tweezers to study the cotranslational folding of the first two domains of 5-domain protein elongation factor G.

In agreement with a number of other presentations at the conference, they report that interactions with the ribosome surface during the early stages of translation slows folding by stabilising disordered states, preventing both native and misfolded conformations. They found that the N-terminal domain (G domain) folds independently, while the subsequent folding of the second domain (Domain II) requires the presence of the folded G domain. Furthermore, while partially extruded, unfolded domain II destabilises the native G domain conformation and leads to misfolding. This is prevented in the presence of the chaperone Trigger factor, which protects the G domain from unproductive interactions and unfolding by stabilising the native conformation. This work demonstrates interesting mechanisms by which Trigger factor and the ribosome can influence the cotranslational folding pathway.

Optical tweezers are used to interrogate the folding pathway of a protein during stalled cotranslational folding. Mechanical force applied to the ribosome and the N-terminal of the nascent chain causes unfolding events, which can be identified as sudden increases in the extension of the chain. (Figure from [9])

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

The prediction of protein contacts from primary sequence is an enormously powerful tool, particularly for predicting protein structures. A major limitation is that current methods using coevolution inference require a large multiple sequence alignment, which is not possible for targets without many known homologous sequences.

In this talk, Thrasyvoulos Karydis presented CoMET (Convolutional Motif Embeddings Tool), a tool to predict protein contact maps without a multiple sequence alignment or coevolution data. They extract structural and sequence motifs from known sequence-structure pairs, and use a Deep Convolutional Neural Network to associate sequence and structure motif embeddings. The method was trained on 137,000 sequence-structure pairs with a maximum of 256 residues, and is able to recreate contact map patterns with low resolution from primary sequence alone. There is no paper on this yet, but we’ll be looking out for it!


1. de Oliveira, S.H. and Deane, C.M., 2018. Exploring Folding Features in Protein Structure Prediction. Biophysical Journal, 114(3), p.36a.
2. Bowman, M.A. and Clark, P.L., 2018. Folding Proteins From One End to the Other. Biophysical Journal, 114(3), p.200a.
3. Baker, T.A. and Sauer, R.T., 2012. ClpXP, an ATP-powered unfolding and protein-degradation machine. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1823(1), pp.15-28.
Acta (BBA) – Molecular Cell Research, 2012, 1823 (1), 15-28
4. Sander, I.M., Chaney, J.L. and Clark, P.L., 2014. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society, 136(3), pp.858-861.
5. Postic, G., Périn, C., Ghouzam, Y. and Gelly, J.C., 2018. An Ambiguous View of Protein Architecture. Biophysical Journal, 114(3), p.46a.
6. Postic, G., Ghouzam, Y., Chebrek, R. and Gelly, J.C., 2017. An ambiguity principle for assigning protein structural domains. Science advances, 3(1), p.e1600552.
7. Gelly, J.C. and de Brevern, A.G., 2010. Protein Peeling 3D: new tools for analyzing protein structures. Bioinformatics, 27(1), pp.132-133.
8. Liu, K., Maciuba, K. and Kaiser, C.M., 2018. Dual Function of the Trigger Factor Chaperone in Nascent Protein Folding. Biophysical Journal, 114(3), p.552a.
9. Liu, K., Rehfus, J.E., Mattson, E. and Kaiser, C., 2017. The ribosome destabilizes native and non‐native structures in a nascent multi‐domain protein. Protein Science.
10. Karydis, T. and Jacobson, J.M., 2018. Predicting Protein Contact Maps Directly from Primary Sequence without the Need for Homologs. Biophysical Journal, 114(3), p.36a.

Dealing with indexes when processing co-evolution signals (or how to navigate through “sequence hell”)

Co-evolution techniques provide a powerful way to extract structural information from the wealth of protein sequence data that we now have available. These techniques are predicated upon the notion that residues that share spatial proximity in a protein structure will mutate in a correlated fashion (co-evolve). This co-evolution signal can be inferred from a multiple sequence alignment, which tells us a bit about the evolutionary history of a particular protein family. If you want to have a better gauge at the power of co-evolution, you can refer to some of our previous posts (post1, post2).

This is more of a practical post, where I hope to illustrate an indexing problem (and how to circumvent it) that one commonly encounters when dealing with co-evolution signals.

Most of the co-evolution tools available Today output pairs of residues (i,j) that were predicted to be co-evolving from a multiple sequence alignment. One of the main applications of these techniques is to predict protein contacts, that is pairs of residues that are within a predetermined distance (quite often 8Å).  Say you want to compare the precision of different co-evolution methods for a particular test set. Your test set would consist of a number of proteins for which the structure is known and for which sufficient sequence information is available for the contact prediction to be carried out. Great!

So you start with your target sequences, generate a number of contact predictions of the form (i,j) for each sequence and, for each pair, you check if the ith and jth residues are in contact (say, less than 8Å apart) on the corresponding known protein structure. If you actually carry out this test, you will obtain appalling precision for a large number of test cases. This is due to an index disparity that a friend of mine quite aptly described as “sequence hell”.

This indexing disparity occurs because there is a mismatch between the protein sequence that was used to produce the contact predictions and the sequence of residues that are represented in a protein structure. Ask a crystallographer friend if you have one, and you will find that in the process of resolving a protein’s structure experimentally, there are many reasons why residues would be missing in the final structure. More so, there are even cases where residues had to be added to enable protein expression and/or crystallisation. This implies that the protein sequence (represented by a fasta file) frequently has more (and sometimes fewer) residues than the proteins structure (represented by a PDB file).  This means that if the ith  and jth residues in your sequence were predicted to be in contact, that does not mean that they correspond to the ith and jth residues in order of appearance in your protein structure. So what do we do now?

A true believer in the purity and innocence of the world would assume that the SEQRES entries in your PDB file, for instance, would come to the rescue. The SEQRES describes the sequence of residues exactly as they appear on the atom coordinates of a particular PDB file. This would be a great way of mitigating the effects of added or altered residues, and would potentially mitigate the effects of residues that were not present in the construct. In other words, the sequences described by SEQRES would be a good candidate to validate whether your predicted contacts are present in the structure. They do, however, contain one limitation. SEQRES also describe any residues whose coordinates were missing in the PDB. This means that if you process the PDB sequentially and that some residues could not be resolved, the ith residue to appear on the PDB could be different to the ith residue in the SEQRES.

An even more innocent person, shielded from all the ugliness of the the universe, would simply hope that the indexing on the PDB is correct, i.e. that one can use the residue indexes presented on the “6th column” of the ATOM entries and that these would match perfectly to the (i,j) pair you obtained using your protein sequence. While, in theory, I believe this should be the case, in my experience this indexing is often incorrect and more frequently than not, will lead to errors when validating protein contacts.

My solution to the indexing problem is to parse the PDB sequentially and extract the sequence of all the residues for which coordinates are actually present. To my knowledge, this is the only true and tested way of obtaining this information. If you do that, you will be armed with a sequence and indexing that correctly represent the indexing of your PDB. From now on, I will refer to these as the PDB-sequence and PDB-sequence indexing.

All that is left is to find a correspondence (a mapping) between the sequence you used for the contact prediction and the PDB-sequence. I do that by performing a standard (global) sequence alignment using the Needleman–Wunsch algorithm. Once in possession of such an alignment, the indexes (i,j) of your original sequence can be matched to adjusted indexes (i',j') on your PDB-sequence indexing. In short, you extracted a sequential list of residues as they appeared on the PDB, aligned these to the original protein sequence, and created a new set of residue pairings of the form (i',j') which are representative of the indexing in PDB-sequence space. That means that the i’th residue to appear on the PDB was predicted to be in contact with the j’th residue to appear.

The problem becomes a little more interesting when you hope to validate the contact predictions for other proteins with known structure in the same protein family. A more robust approach is to use the sequence alignment that is created as part of the co-evolution prediction as your basis. You then identify the sequence that best represents the PDB-sequence of your homologous protein by performing N global sequence alignments (where N is the number of sequences in your MSA), one per entry of the MSA. The highest scoring alignment can then be used to map the indexing. This approach is robust enough that if your homologous PDB-sequence of interest was not present in the original MSA for whatever reason, you should still get a sensible mapping at the end (all limitations of sequence alignment considered).

One final consideration should be brought to the reader’s attention. What happens if, using the sequence as a starting point, one obtains one or more (i,j) pairs where either i or j is not resolved/present in the protein structure? For validation purposes, often these pairs are disregards. Yet, what does this co-evolutionary signal tell us about the missing residues in the structure? Are they disordered/flexible? Could the co-evolution help us identify low occupancy conformations?

I’ll leave the reader with these questions to digest. I hope this post proves helpful to those braving the seas of “sequence hell” in the near future.

Teaching Network Science to High School Students

In the recent years, a lot of effort went into outreach events, in particular for science and mathematics. Here, I am going to mention a summer course on Network Science which I organized and taught together with Benjamin F. Maier from the Humboldt University Berlin.

The course was part of an established German summer school called Deutsche Schülerakademie (German Pupils Academy), an extracurricular event for highly motivated pupils. It lasts sixteen days and the participants join one of six courses, which cover all ranges of academic disciplines, from philosophy over music to science.  

Our course was titled Netzwerke und Komplexe Systeme (Networks and Complex Systems) and rather than going too much in depth in one particular area we covered a broad selection of topics, as we wanted to give students an overview and also an idea of how different disciplines approach complex phenomena. We discussed pure Mathematics topics as the colouring of graphs, algorithmic discussions as the travelling salesman problem, social network analysis, computational neuroscience, dynamical systems, and fractals.

A network of the former monastery in Rossleben, where the summer school was held. The students created the network themselves. To parallelise the task they split up into four groups, each covering one level of the building. They then used this network to simulate the spread of a contagious disease, starting at the biological lab (A35, in red).

A couple of thoughts on what went well and which parts might need improvement for further of such events:

  • We did a questionnaire before and asked the pupils some questions like “Do you know what a vector is?” and also concerning their motivation to join the course. This was very helpful in getting a rough idea about their knowledge level.
  • We gave them some material to read before the course. In retrospective, it probably would be better to give them something to read, as well as, some problems to solve, such that the learning outcome is clearer and more effective.
  • The students gave presentations on topics we choose for them based on their answers to the questionnaire. The presentations were good but a lot of students overrun the allocated time because they were very enthusiastic about the topics.
  • The students were also enthusiastic about the programming exercises, for which we used Python and the NetworkX library. One challenge was the heterogeneity in programming experience, this made the splitting up into two groups, beginner and advanced, necessary.
  • In contrast to courses covering similar topics at university-level, the students did not have the necessary mathematical background for the more complicated aspects of network science. Accordingly, it is better to choose less of these and allocate time to introduce the mathematical methods, for example, eigenvectors or differential equations, beforehand.
  • The students very much liked hand on exercises, for example, the creation of random networks of different connection probabilities with the help of dice or the creation of a network of the floor plan of the building in which the summer school was held, as shown in the Figure.

It was great fun to introduce the students to the topic of network science and I strongly can recommend others to organise similar outreach events! You can find some of our teaching materials, including the worksheets and programming exercises in the original German and a translated English version, online. A paper describing our endeavours is under review.

Four Erdős–Rényi random graphs as generated by the participants by rolling dice. A twenty-sided dice was used for the probabilities p = 1/20 and p = 10 and a six-sided dice for p = 1/6 and p=1/3. This fun exercise allows the discussion of degree distributions, the size of the largest connected component, and similar topics for ER random graphs.

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

  • Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
  • Step 2: generate four predicted contact maps using PconsC3.
  • Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
  • Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts  force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

Journal Club post: Interface between Ig-seq and antibody modelling

Hi everyone! In this blog post, I would like to review a couple of relatively recent papers about antibody modelling and immunoglobulin gene repertoire NGS, also known as Ig-seq. Previously I used to work as a phage display scientist and I initially struggled to understand all new terminology about computational modelling when I joined Charlotte’s group last January. Hence, the paramount aim of my blog post is to decipher commonly used jargon in the computational world into less complicated text.

The three-dimensional structure of an antibody dictates its function. Antibody sequences obtained from Ig-seq cannot be directly translated into antibody folding, aggregation and function. Several ways exist to interrogate antibody structure, including X-ray crystallography and NMR spectroscopy, expression, and computational modelling. These methods vary in throughput as well as precision. Here, I will concentrate my attention on computational modelling. First of all, the most commonly confused term is a decoy. In antibody structure prediction, a decoy is a modelled antibody structure that can be ranked and selected by a tool as the closest to the native antibody structure. A number of antibody modelling tools exist, each employing a different methodology and a number of generated decoys. Good reviews on the antibody structure prediction are here (1,2). I will try to draw a very gross summary about how all these unique modelling tools work. To do so, I assume that people are familiar with antibody sequence/structure relationship – if not please check (3). Antibody framework region are sequence invariant, hence their structure can be deduced from sequence identity with high confidence. PDB (4) act as the source of structures for antibody modelling. Canonical CDRs (all CDRs except for CDR-H3) can be put into a limited number of structures. Several definitions of canonical classes exist (5,6), but, in essence, the canonical CDR must contain residues that define a particular class. Next, antibody orientation is calculated or copied from PDB. CDR-H3 modelling is very challenging and different approaches have been devised (7–9). The structure space of CDR-H3 is very vast (10) and hence, this loop cannot be put into a canonical class. Once CDR-H3 is modelled, the resultant decoy is checked for clashes (like impossible orientation of side chains).

Here, I would like to mention several examples on how antibody modelling can help to accelerate drug discovery. Dekosky et al. (11) mapped two Ig-seq datasets to antibody structures to interrogate how an antibody paratope changes in response to antigenic stimulation. The knowledge of paired full length VH-VL is crucial for the best antibody structure prediction. In this study they employed paired chain Ig-seq (12). However, this technique cannot sequence full length VH/VL, hence the V gene sequence had to be approximated. Computational paratope identification was employed to examine paratope convergences. There were several drawbacks of this paper: only 2,000 models (~1% of Ig-seq data) were modelled in 570,000 CPU time, and antibody sequences with longer than 16 aa long CDR-H3 were not included into analysis. The generation of a reliable configuration of long CDR-H3 is considered a hard task at this moment. Recently, Laffy et al. (13) investigated antibody promiscuity by mapping sequence to structure and validating the results with ELISA. The cohort of 10 antibodies, all with long CDR-H3 >= 15 aa were interrogated. They used a homology modelling tool to devise CDR-H3 structures. However, the availability of the appropriate structural template can be questioned, since CDR-H3 loops deposited in the PDB are predominantly shorter due to crystallographic constraints. As mentioned before, the paired VH/VL data is crucial for structure determination. Here, they used Dekosky et al. (11) data to devise the pairing. The approach can be streamlined once more paired data become available.

In conclusion, antibody modelling enables researchers to circumvent the cost and time associated with experimental approaches of antibody characterizations. The field of antibody modelling still needs improvements for faster and better structure prediction to achieve tasks such as modelling the entirety of Ig-seq data or long CDR-H3 loops. Currently, the fastest tool of antibody modelling is ABodyBuilder (8). It generates a model in 30 sec and its version is available online (http://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/Modelling.php). The availability of more structural information as well as algorithm improvements will facilitate more confident antibody modelling.

 

  1. Kuroda D, Shirai H, Jacobson MP, Nakamura H. Computer-aided antibody design. Protein Eng Des Sel (2012) 25:507–521. doi:10.1093/protein/gzs024
  2. Krawczyk K, Dunbar J, Deane CM. “Computational Tools for Aiding Rational Antibody Design,” in Methods in molecular biology (Clifton, N.J.), 399–416. doi:10.1007/978-1-4939-6637-0_21
  3. Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotech (2014) 32:158–168. doi:10.1038/nbt.2782
  4. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res (2007) 35: doi:10.1093/nar/gkl971
  5. Nowak J, Baker T, Georges G, Kelm S, Klostermann S, Shi J, Sridharan S, Deane CM. Length-independent structural similarities enrich the antibody CDR canonical class model. MAbs (2016) 8:751–760. doi:10.1080/19420862.2016.1158370
  6. North B, Lehmann A, Dunbrack RL. A new clustering of antibody CDR loop conformations. J Mol Biol (2011) 406:228–256. doi:10.1016/j.jmb.2010.10.030
  7. Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, Kuroda D, Frick R, Adolf-Bryfogle J, Biswas N, Dunbrack RL, Gray JJ. Modeling and docking of antibody structures with Rosetta. Nat Protoc (2017) 12:401–416. doi:10.1038/nprot.2016.180
  8. Leem J, Dunbar J, Georges G, Shi J, Deane CM. ABodyBuilder: Automated antibody structure prediction with data–driven accuracy estimation. MAbs (2016) 8:1259–1268. doi:10.1080/19420862.2016.1205773
  9. Marks C, Nowak J, Klostermann S, Georges G, Dunbar J, Shi J, Kelm S, Deane CM. Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction. Bioinformatics (2017) 33:1346–1353. doi:10.1093/bioinformatics/btw823
  10. Regep C, Georges G, Shi J, Popovic B, Deane CM. The H3 loop of antibodies shows unique structural characteristics. Proteins Struct Funct Bioinforma (2017) 85:1311–1318. doi:10.1002/prot.25291
  11. DeKosky BJ, Lungu OI, Park D, Johnson EL, Charab W, Chrysostomou C, Kuroda D, Ellington AD, Ippolito GC, Gray JJ, et al. Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires. Proc Natl Acad Sci U S A (2016)1525510113-. doi:10.1073/pnas.1525510113
  12. Dekosky BJ, Kojima T, Rodin A, Charab W, Ippolito GC, Ellington AD, Georgiou G. In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nat Med (2014) 21:1–8. doi:10.1038/nm.3743
  13. Laffy JMJ, Dodev T, Macpherson JA, Townsend C, Lu HC, Dunn-Walters D, Fraternali F. Promiscuous antibodies characterised by their physico-chemical properties: From sequence to structure and back. Prog Biophys Mol Biol (2016) doi:10.1016/j.pbiomolbio.2016.09.002