Prague Protein Spring 2018

We, Constantin and Dominik, the newest members of OPIG (SABS rotation students, as usual) were lucky to have a conference suitable to our research within our rotation period and, granted an allowance from the powers that be, were able to visit this year’s Prague Protein Spring with the topic ‘Proteins at Work’. There, we spent four busy but very inspirational days with about 50 participants in a little palace, the Vila Lanna.

The general topic of this meeting led to a broad variety of talks representing a multitude of fields of protein research: from origins of life, over fuzzy intrinsically disordered proteins and crowded cells to metagenomics and functional sequence alignment annotation.

We picked four thought engaging talks to present at the group meeting on 08/05/2018; here are their summaries:

Protein engineering and in vitro evolution studies for the origins of life

Kosuke Fujishima from Tokyo Institute of Technology presented several examples of the research he conducts in the area of origins of life. Research on the origins of life are generally based around the questions how prebiotic monomers were created, how they condensed into polymers and how functionality emerged within these polymers.

The first example of his research deals with the condensation of prebiotic monomers on the ocean-earth crust-interface. Water cycling between the ocean and the outer layers of the earth’s core provided an environment of high pressure and high temperatures (80 – 200 °C) which is necessary for amino acid polymerisation. The mineral Olivine was found to attract amino acids to its surface and the serpentinisation reaction happening with Olivine might provide the necessary wet/dry cycle. Therefore, the researchers built a reactor aiming to investigate this potential polymerisation mechanism. They found that with providing six prebiotic amino acids, 28 out of 36 possible dipeptides could be found in the reactor. Furthermore, up to 10-mer linear polypeptides could be detected as well, providing evidence for a mechanism of early earth’s generation of polypeptides [unpublished].

The second project showed that both enzymes, CysE/CysK, responsible for the current production of cysteine from serine, could be re-engineered to contain no cysteine in their sequence. Interestingly, cysteine-free CysE showed higher reaction rates than the wild type. Additional reduction to cysteine- and methionine-free enzyme sequences only worked for CysE but not for CysK.[Fujishima et al. (2018)] Still, the experiments indicate that an enzyme world could have existed with a reduced number of amino acids compared to the 20(+) amino acids that we know today.

The third project we wanted to point out used a type of mRNA display that not only links the genotype (mRNA) with its corresponding phenotype (translated protein) but also allows the translated protein to interact with a randomised, non-translated part of the mRNA. This provided a framework for investigating the evolution of ribonucleotide-binding (RNP) proteins. When selecting for ATP-binding, it was observed that protein together with RNA had the best fitness landscape compared to protein selection or RNA selection alone. Further analysis revealed that most binding affinity of the ribonucleotide protein stemmed from its RNA part.[unpublished] These results give rise to the suggestion that RNA and proteins co-evolved, opposing the idea of a pure RNA world.

RNA-protein interactions and the structure of the genetic code

The next speaker added more to the research area of RNA-protein interaction and evolution. Bojan Zagrovic from the University of Vienna presented his research around the finding that pyrimidine (PYR) density of RNA regions is correlated with the corresponding protein region’s affinity to pyrimidine-containing bases (running means of 21 amino acids or 63 bases were used), with the highest correlation between mRNA PYR density and guanine affinity, having an average ‘typical’ Pearson correlation coefficient of 0.80.[Polyansky & Zagrovic (2013)]

This correlation is specific for the current genetic code, shown by random generation of genetic codes which could not reproduce such a correlated behaviour and by looking into three organisms with very different codon usage bias (homo sapiens, E. coli, M. jannaschii). Even though the three averages of codon usage were very different, the highest correlating pairs of mRNA and cognate proteins clustered together, having very similar codon usage. This was also true for the worst correlating pairs.[Hlevnjak & Zagrovic (2015)]

But the big question being: what does this correlation imply functionally?

Annotation analysis revealed that the highest correlating pairs were enriched in nucleotide-binding functions and intrinsically disordered proteins. Without claiming generalisability, Professor Zagrovic pointed out a case study done on RNA polymerase II which has a long disordered C-terminus build up by 26 repeats of a 7 amino acid motif. 248 RNAs were found to interact with RNA polymerase II and in all three reading frames of the interacting RNAs, amino acid codons of the polymerase’s C-terminus were enriched.[unpublished]

This indicates some regulation over gene expression but also several other hypotheses were made: the correlation between the protein regions’ affinity for their cognate mRNA regions might be relevant in virus assembly, since coding RNA and translated proteins have to be in close proximity with each other. The same could be true for some non-membrane-bound compartments, e.g. P-bodies. Or is this correlation characteristic a hint to mRNAs acting as chaperones for their respective proteins? The functional implications of this correlation, while highly speculative, nevertheless suggest exciting research to come in the future.

Fuzziness in protein assemblies

Research from a different, but equally thought provoking field was presented by Mónika Fuxreiter from the University of Debrecen. Her talk on the concept of fuzziness in protein complexes, which she introduced 10 years ago [Tompa & Fuxreiter (2008)], shed light on some more recent developments in the field as well as explaining the underlying concept for those of us (ourselves included) who have not encountered the concept as such before.

Fuzziness in the context of protein complexes describes a phenomenon in which intrinsically disordered proteins, instead of folding upon binding as one would usually observe, can sample several conformational states with different propensities, leading to the sampled states contributing with different strengths to the function of the protein complex and further leading to varying degrees of disorder in the bound state.

This observation has several implications for the understanding of the functionality of disordered proteins, since the relative propensity for different ensemble states in the bound form is thought to be highly susceptible to milieu influences, such as tissue specific splicing and post-translational modifications. Fuzziness (a term that was borrowed from the mathematical theory of fuzzy sets) could thus be a driver of functional adaptability of disordered proteins to cell-cycle stage, environmental influences or tissue type.

Evidence for fuzziness has been curated by the Fuxreiter group since 2015 [Miskei et al. 2017] in the FuzDB database and recently been used to develop a prediction algorithm [unpublished], that according to Professor Fuxreiter achieves highly accurate predictions of fuzziness on a comprehensive validation dataset.

Both the implications of fuzziness for the understanding of the mode of action for disordered proteins (and disordered regions in otherwise ordered proteins) certainly spiked our interest, not least due to the potential importance of a clear understanding of these mode of actions for drug development.

Investigation of mutually exclusive splicing events using the CATH FunFam framework

The last of the 4 talks we would like to single out in this blogpost highlighted recent progress in using structure-based databases for the investigation of complex cellular events.

Christine Orengo from UCL presented her group’s work on mutually exclusive splicing, which employed the FunFam framework of the CATH database to probe the structural and functional implications of these splicing events [Lam et al. (2018), under review].

The FunFams are a subcategory of CATH’s homologous superfamilies, which further divides the superfamilies based on clusters of residue conservation within each family, thus creating groupings of functionally related proteins [Rentzsch & Orengo (2013)].

Mutually exclusive splicing that were investigated using this framework are a group of splicing events in which only one of several specific exons is present in the spliced mRNA. These exons usually show a high level of sequence similarity, leading to a low disruption of the protein structure by the splicing event. It is thought that this feature is a reason for the relative enrichment of mutually exclusive exons amongst alternative splicing events in the proteome.

This high degree of sequence similarity further enabled the mapping of the mutually exclusive exons to FunFams in the CATH database and thus further onto protein structures. This allowed the Orengo group to conduct a ‘large scale systematic study of the structural/functional effects of MXE splicing’.

Their analysis found that variable residues between the exons are significantly enriched at the protein surface, both compared to other stretches of the protein sequence and compared to non-variable residues in the exons, and in close proximity (< 6 Angstroms) to functional sites of the protein.

The main conclusion drawn from these findings was that, as previously hypothesised, mutually exclusive exons are likely functional switches, since changes in the surface exposed area close to functional sites are likely to affect the protein function without strongly disrupting its structure.

In the eyes of the Orengo group, this makes these splicing events good candidates for drug targeting, particularly in cases where a tissue specific isoform can be drugged, since in that case off-target effects could potentially be significantly reduced.

Sources:

Fujishima et al. (2018). Reconstruction of cysteine biosynthesis using engineered cysteine-free enzymes. Scientific Reports

Hlevnjak & Zagrovic (2015). Malleable nature of mRNA-protein compositional complementarity and its functional significance. Nucleic Acids Res

Lam, S. D., Orengo, C., & Lees, J. (2018). Protein structure and function analyses to understand the implication of mutually exclusive splicing. BioRxiv

Miskei, M. et al (2017). FuzDB: Database of fuzzy complexes, a tool to develop stochastic structure-function relationships for protein complexes and higher-order assemblies. Nucleic Acids Research

Polyansky & Zagrovic (2013). Evidence of direct complementary interactions between messenger RNAs and their cognate proteins. Nucleic Acids Res

Rentzsch, R., & Orengo, C. A. (2013). Protein function prediction using domain families. BMC Bioinformatics

Tompa, P., & Fuxreiter, M. (2008). Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends in Biochemical Sciences

 

Neuronal Complexity: A Little Goes a Long Way…To Clean My Apartment

The classical model of signal integration in both biological and Artificial Neural Networks looks something like this,

f(\mathbf{s})=g\left(\sum\limits_i\alpha_is_i\right)

where g is some linear or non-linear output function whose parameters \alpha_i adapt to feedback from the outside world through changes to protein dense structures near to the point of signal input, namely the Post Synaptic Density (PSD). In this simple model integration is implied to occur at the soma (cell body) where the input signals s_i are combined and broadcast to other neurons through downstream synapses via the axon. Generally speaking neurons (both artificial and otherwise) exist in multilayer networks composing the inputs of one neuron with the outputs of the others creating cross-linked chains of computation that have been shown to be universal in their ability to approximate any desired input-output behaviour.

See more at Khan Academy

Models of learning and memory have relied heavily on modifications to the PSD to explain modifications in behaviour. Physically these changes result from alterations in the concentration and density of the neurotransmitter receptors and ion channels that occur in abundance at the PSD, but, in actuality these channels occur all along the cell wall of the dendrite on which the PSD is located. Dendrites are something like a multi-branched mass of input connections belonging to each neuron. This begs the question as to whether learning might in fact occur all along the length of each densely branched dendritic tree.

Continue reading

The Ten Commandments of OPIG

In OPIG one must learn, and one must learn FAST! However, sometimes stupidity in OPIG knows no limits (*cough* James *cough* Anne *cough*), so for the newer (and prospective) members of the group, I thought it wise to share the some ground rules, a.k.a. The Ten Commandments of OPIG.

Vaguely adhering to these will drastically improve your time in OPIG (see Exhibit A), and let’s face it, none of them are particularly challenging.

  1. No touchy the supervisor.
  2. No touchy other students.
  3. You’re not late unless you’re after Charlotte. Don’t be late.
  4. All prizes are subject to approval by The Party.
  5. Thou shalt not tomate.
  6. Any and all unattended food is fair game.
  7. Meetings (especially the one before yours) will go on as long as they have to.
  8. Finish your DPhil or die.
  9. This is not a democracy.
  10. NO TOUCHY THE SUPERVISOR!

Bonus (and final rule). If this is your first time at Group Meeting, you have to present (well at least introduce yourself).

P.s. we’re not that bad, I promise!

Disclaimer: while I’ve categorised this post as “humour”, I take no responsibility for your enjoyment.

My experience with (semi-)automating organic synthesis in the lab

After three years of not touching a single bit of glassware, I have recently donned on the white coat and stepped back into the Chemistry lab. I am doing this for my PhD project to make some of the follow-up compounds that my pipeline suggests. However, this time there is a slight difference – I am doing reactions with the aid of a liquid handler robot, the Opentrons. This is the first encounter that I have with (semi-)automated synthesis and definitely a very exciting opportunity! (Thanks to my industrial sponsor, Diamond Light Source!)

A picture of the Opentrons machine I have been using to do some organic reactions. Picture taken from https://opentrons.com/robots.

Opentrons is primarily used by biologists and their goal is to make a platform to easily share protocols and reproduce each other’s work (I think we can all agree how nice this would be!). They provide a very easy to use API, wishing it to be accessible to any bench scientist with basic computer skills. From my experience so far, this has been the case as I found it extremely easy to pick up and write my own protocols for chemical reactions. Here is the command that will: (1) pick up a new pipette tip; (2) transfer a volume from source1 to destination1; (3) drop the pipette tip in the trash; (4) pick up a new pipette tip; (5) transfer a volume from source2 to destination2; (5) drop the pipette tip in the trash.

pipette.transfer(volume, [source1, source2], [destination1, destination2], new_tip=’always')

But of course not everything is plain sailing – there are many challenges you will encounter by using an automated pipette. The robot is a liquid handler – it cannot handle solids so either the solids need to be pre-weighed and/or made into solution beforehand. Further difficulties lie within the properties of the solvent it is handling, for example:

  • Dripping – low boiling point solvents tend to drip more.
  • Viscosity of liquids causes issues with not drawing up the correct amount of liquid – more viscous liquids require longer times to aspirate and if aspiration is too quick then air pockets may be drawn up.

Here is a GIF I made of a dry run I was doing with the robot (sorry for the slight shake, this was recorded on my phone in the lab… See their website for professional footage of the robot!)

My (shaky) footage of a dry run I was performing with the Opentrons.

The Curious Case Of A Human Chimera

In my role as a PhD student in the OPIG group, I integrate and analyse data from various biological, chemical and data sources. As I am interested in the intersection between chemistry, biology and daily life, it seems suitable that my next BLOPIG posts will discuss and highlight how biological phenomena have either influenced law or history.

Connection between Law and Biology – The Curious Case Of A Human Chimera
Our scene opens in a dark lab, where a scientist injects himself with an unknown substance. The voice over notes that they created a monster named “Chimera” while searching for their hero “Bellerophon”.  This scene is the famous opening scene of the movie “Mission Impossible II” , where we are introduced to the dangerous bioweapon “Chimera”, a combination of multiple diseases. As “Chimera” is a mythological beast from Ancient Greek mythology, with a lion’s head, a goat’s body, and a serpent’s tail, the naming of this bioweapon seems appropriate.

What does this dangerous mixture of multiple diseases, an ancient mythological monster and the promised connection between law and biology have in common?

Apart from a really bad joke, the term “Chimera” is an actual term in biology to describe a biological entity of multiple diverse components, e.g. a human organism, whose cells are composed of distinct genotypes.
In case of tetragametic chimerism, human chimeras thus possess forty-six chromosome pairs instead of the “usual” set of twenty-six chromosome pairs, and as such, their organs and tissues are constructed according to the DNA outlined in the respective organ or tissue.
Tetragametic chimerism occurs by the fertilization of two ova by two spermatozoa, which develop into zygotes. These zygotes then subsequently fuse into one organism, which continues to develop into an organism with two sets of DNA.1-2

But how did such a biological phenomenon like a chimera enter the court of law?

The Romans famously defined that the mother of a child is the one who gives birth to it (Mater sempre certa est, which can be translated as “The mother is always certain”).  I would like to point out that in the times of in-vitro fertilization, this principle is no longer viable, since a child can now have both a genetic mother and a birth mother.3
This Principle was disproved in 2002, when Lydia Fairchild applied to receive Welfare for her two children and her third, unborn child, from the US State. Paternity tests were conducted on all children to prove her ex-partner’s paternity. While the tests proved the paternity of the father without a doubt, Lydia was shown to be no genetic match to her children.

Accused of being a “wellfare fraud” or a surrogate, the judge ordered that Lydia Fairchild had to give birth to her third child in front of witnesses. Immediately blood samples were taken, which revealed that Lydia Fairchild also did not share DNA with this child, despite giving birth to it. Now accused of being a surrogate, Lydia’s case looked dire.
Fortunately, Lydia’s lawyer read a journal article about a similar case involving a woman named Kareen Keegan.2, 4-5 Karen, a 52-year old woman, had renal failure. As she needed a kidney replacement, Karen’s sons underwent the histocompability process to test for donation.Yet the genetic tests showed that only one of her three sons was related to her.1 Material from her entire body was tested for genetic matches to her sons’ DNA, but only genetic material of her thyroid matched her sons.2
Ultimately, the researchers concluded that Karen was a tetragametic chimera, born of the fusion of her zygote and her twin sibling in her mother’s womb. As Dr. Lynne Uhl, a pathologist and doctor of transfusion medicine at Beth Israel Deaconess Medical Center in Boston, said:
“In her blood, she was one person, but in other tissues, she had evidence of being a fusion of two individuals.”6

Subsequently, scientists collected Lydia’s cell material from various body parts and tested for a genetic match with her children. The DNA from her cervical smear was found to be a match, while the DNA collected from her skin and hair was not. Additionally, DNA samples from Lydia’s mother matched her childrens’ DNA. 4-5

Interestingly, while both Lydia and Karen were carrying two sets of DNA as a result of prenatal fusions with their twins, they didn’t show any phenotypic sign of being a chimera, e.g. different skin types or the so-called Blaschko lines.7-8

 

  1. https://www.scientificamerican.com/article/3-human-chimeras-that-already-exist/
  2. To, E. & Report, C. LEADING TO IDENTIFICATION OF TETRAGAMETIC CHIMERISM. 346, (2002).
  3. https://en.wikipedia.org/wiki/Mater_semper_certa_est
  4. https://pictorial.jezebel.com/one-person-two-sets-of-dna-the-strange-case-of-the-hu-1689290862
  5. https://web.archive.org/web/20140301211020/http://www.essentialbaby.com.au/life-style/nutrition-and-wellbeing/when-your-unborn-twin-is-your-childrens-mother-20140203-31woi.html
  6. http://abcnews.go.com/Primetime/shes-twin/story?id=2315693
  7. https://jamanetwork.com/journals/jamadermatology/fullarticle/419529
  8. http://biologicalexceptions.blogspot.co.uk/2015/09/when-youre-not-just-yourself.html

All links were last viewed on the 24.04.2018.

My next blog post: Can a mismatch in maternal DNA threaten a government? How Biology can Influence History.

Measuring correlation

Correlation is defined as how close two variables are to having a dependence relationship with each other. At first sight, it looks kind of simple, but there are two main problems:

  1. Despite the obvious situations (i.e. correlation = 1), it is difficult to say whether 2 variables are correlated or not (i.e correlation = 0.7). For instance, would you be able to say if the variables X and Y from the following to plots are correlated?
  2. There are different ways of measure of correlation that may not agree when comparing different distributions. As an example, which plot shows a higher correlation? The answer will depend on how you do measure the correlation since if you use Pearson correlation, you would pick A whereas if you choose Spearman correlation you will take B

Here, I will explain some of the different correlation measures you can use:

Pearson product-moment correlation coefficient

  • What does it measure? Only linear dependencies between the variables.
  • How it is obtained? By dividing the covariance of the two variables by the product of their standard deviations. (It is defined only if both of the standard deviations are finite and nonzero). \rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}
  • Properties:
  1. ρ (X,Y) = +1 : perfect direct (increasing) linear relationship (correlation).
  2. ρ (X,Y) = -1 : perfect decreasing (inverse) linear relationship (anticorrelation).
  3. In all other cases, ρ (X,Y) indicates the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
  4. Only gives a perfect value when X and Y are related by a linear function.
  • When is it useful? For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of r, Pearson’s product-moment coefficient.

 

Spearman’s rank correlation coefficient:

  • What does it measure? How well the relationship between two variables can be described using a monotonic function (a function that only goes up or only goes down).
  • How it is obtained? Pearson correlation between the rank values of the two variables.

{\displaystyle r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}}

Only if all n ranks are distinct integers, it can be computed using the popular formula.

{\displaystyle r_{s}={1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}.}

Where di is the difference between the two ranks of each observation.

  • Properties:
  1. rs (X,Y) = +1:  X and Y are related by any increasing monotonic function.
  2. rs (X,Y) = -1:  X and Y are related by any decreasing monotonic function.
  3. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Kendall’s tau coefficient

  • What does it measure? The ordinal association between two measured quantities.
  • How it is obtained?

{\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}

Any pair of observations (xi , yi)  and (xj, yj) are said to be concordant if the ranks for both elements agree. That happens if xi-xj and yi-xj have the same sign. If their sign are different, they are considered as discordant pairs

  • Properties:
  1. τ (X,Y) = +1: The agreement between the two rankings is perfect (i.e., the two rankings are the same)
  2. τ (X,Y) = -1: The disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other)
  3. If X and Y are independent, then we would expect the coefficient to be approximately zero.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Distance correlation:

  • What does it measure? Both linear and nonlinear association between two random variables or random vectors.
  • How is it obtained? By dividing the variable’s distance covariance by the product of their distance standard deviations:

\operatorname {dCor}(X,Y)={\frac {\operatorname {dCov}(X,Y)}{{\sqrt {\operatorname {dVar}(X)\,\operatorname {dVar}(Y)}}}},

The distance covariance is defined as:

{\displaystyle \operatorname {dCov} _{n}^{2}(X,Y):={\frac {1}{n^{2}}}\sum _{j=1}^{n}\sum _{k=1}^{n}A_{j,k}\,B_{j,k}.}

Where:

{\displaystyle A_{j,k}:=a_{j,k}-{\overline {a}}_{j\cdot }-{\overline {a}}_{\cdot k}+{\overline {a}}_{\cdot \cdot },\qquad B_{j,k}:=b_{j,k}-{\overline {b}}_{j\cdot }-{\overline {b}}_{\cdot k}+{\overline {b}}_{\cdot \cdot },}

{\begin{aligned}a_{{j,k}}&=\|X_{j}-X_{k}\|,\qquad j,k=1,2,\ldots ,n,\\b_{{j,k}}&=\|Y_{j}-Y_{k}\|,\qquad j,k=1,2,\ldots ,n,\end{aligned}}

where || ⋅ || denotes Euclidean norm.

  • Properties:
  1. dCor (X,Y) = 0 if and only if the random vectors are independent.
  2. dCor (X,Y) = 1: Perfect dependence between the two distributions.
  3. dCor (X,Y) is defined for X and Y in arbitrary dimension.
  • When is it useful? It is appropriate to find any kind  dependence relationships between the 2 variables. Also if X and Y have different dimensions.

OPIGTREAT

On the 19th of March OPIG set off on our group retreat – henceforth referred to as the OPIGTREAT.

We kicked off a little late as apparently Saulo and check in times are not a good combination (though he is an expert at reversing on an icy road).

Jin and Flo gave the first talk on web programming specifically Flask and D3. If I understood correctly flask is a web development framework for python that runs everything on the server side. Whereas D3 is data/driven/document, which appears to be a way of making very pretty things.

Garrett then gave us an impressive overview on the area of docking, thinking about whether docking had improved in the last 10 years. He discussed how docking can be used to both predict the binding mode (the orientation and conformation) as well as the binding affinity. The state of the art appears to be if we are docking a small molecule into approximately the correct binding site a native like pose can be identified but binding affinity prediction in all cases remains challenging.

Mark then attempted the impossible, he tried to give a talk explaining how to give a good talk. In this case in the context of public engagement and taking our work out to schools. I am now versed in the 4 Ms Manageable, Measurable, Made first and Most Important. I am also weirdly aware that my head shouldn’t move when I am teaching.

Ellliot then took us through how we should judge a PDB structure, a really useful skill for everyone in the group. He described measures such as resolution, B factors Rfree, Clash score, Ramachandran outliers, sidechain outliers and RSRZ outliers. Interesting facts that I collected the average resolution of an X-ray structure in the PDB is ~2A and the average Rfree is 0.25. I also learnt of the existence of PDBredo a service that re-refines datasets in the PDB.

Saulo and the Fergi were up next and they treated us all to a short talk and then a Jupyter notebook practical on machine learning. They discussed supervised, unsupervised and reinforcement learning. Giving examples of each and how and when they should/could be used. Claire and I then learnt a great deal about Jupyter notebooks, the most important thing being to press shift enter. Useful facts “out of the bag” is a method for measuring the error of random forests, score using all data points apart from those used to make that tree.  

The evening finished with a film about the evil iniquities of smoking (very high brow stuff!?!).

The second day began with Bernhard (a visitor from the far of land of Barcelona these days) talking to us about his latest research project. As this is his story – no details in the blog.

Claire then gave an update of the talk she gave at the last OPIGTREAT – how to make “stuff” pretty. Obviously a popular topic as we all wish to display our data and findings in a way that is easily interpretable as well as visually appealing. Claire took us through some of the tools to use like ggplot and Pymol – showed us where to find the lists of useful commands and then showed us the types of images you could make if you really put some thought into it.

Anne was up next, she discussed the challenges and opportunities of integrating heterogeneous data sources and she came up with a lot of data sources to think about, running from protein structures, protein interactions, small molecule structures, drug safety, drug targets, functional annotation and pathways. One thing to remember probably don’t tell your boss when she should or shouldn’t be taking notes……

It was then the turn of team networks Javi, James and Lyuba who walked us through the basics of networks and expanded on their uses across multiple data types in biology. They mentioned areas from simple motifs to protein structure, MD simulations, ontologies, disease prediction, drug target identification…. We then had a practical to check we had understood the power of networks! The networks under consideration were dolphins, Myoglobin structure, Facebook data and the mystery voter network (where we discovered that Fergus the first in no way tried to rig the vote for what film to watch).

That afternoon I visited the bird sanctuary just down the road, others went to a gin distillery or on a walk. Top quote of the afternoon was from James “I want the birds to eat from my pants”. I believe he is from one of those countries that has the misguided belief that pants means trousers. Actually I could have a different top quote from Alex about somebody being a cheap ride in his dreams but I think I should pass over that one.

That evening we were treated to a fragment based drug discovery extravaganza headed up Hannah, Susan and Joe. They took us through the use of fragments for drug discovery and then we attempted a practical. I seem to remember that Claire and I once again excelled at shift enter on the Jupyter notebook.

That evening we had a pub quiz, which apparently ended in a draw between all the teams playing. I feel that Claire and Flo as quizmasters might have made a minor miscalculation. I was happy though as I ended up with the minions bowl and cup. I also managed to persuade several grown men to jump and smash chocolate eggs on their heads on the ceiling.

Next morning Alex and Matt were up first. In their talk they demonstrated not only their knowledge on the area of the future immunotherapy repertoire but also their ability to finish each other’s sentences. They gave a really excellent overview of current immunotherapies and where the field is moving and what might be the future. Facts to store in the head, first ever approved AB therapeutic Muromonab (1986). Currently most successful Humira (Adalimumab) from Abbvie worth 18.4b dollars in 2017, this is a fully human AB for autoimmune diseases and binds to the mediator of inflammation (TNF-alpha).

Next up Catherine and Lucian who discussed distributed computing in PySpark, they started by explaining why distributed computing is going to become so important. Basic info by 2025, 100 million to 2 billion human genomes will have been sequenced that is 2 – 40 exabytes of data. They discussed distributed computing vs centralised and Pyspark compared to Hadoop. There was a practical but Mark had to solo perform for the audience leading to one of the top photos of the whole OPIGTREAT.

As a punishment for being in charge I gave the final talk where I discussed future research direction and how you decide what those might be.

So with thanks to all of the group that concludes the OPIGTREAT report.

New avenues in antibody engineering

Hi everyone,

In this blog post I would like to review an unusual antibody scaffold that can potentially give rise to a new avenue in antibody engineering. Here, I will discuss a couple of papers that complement each others research.

My DPhil is centered on antibody NGS (Ig-seq) data analysis. I always map an antibody sequence to its structure as the three-dimensional antibody configuration dictates its function, the piece of information that cannot be obtained from just the nucleotide or amino acid sequence. When I work with human Ig-seq data, I bear in mind that antibodies are composed of two pairs of light and heavy chains that tune the antibody towards its cognate antigen. In the light of recent research discoveries, Tan et al., found that antibody repertoires of people that live in malaria endemic regions have adopted a unusual property to defend the body from the pathogen (1). Several studies followed up on this discovery to further dissect the yet uncharacterized property of antibodies.

Malaria parasites in the erythrocytic stage produce RIFIN proteins that are displayed on the surface of the erythrocytes. The main function of RIFINs is to bind to the LAIR1 receptors that are found on the surface on the immune cells. The LAIR1 receptor is inhibitory, which leads to inhibition of the immune system. The endogenous ligand of the LAIR1 receptor is collagen, which is found on the surface of body cells. This is to make sure that the immune cells will not be activated against its own body. Activating the LAIR1 receptors is one of the escape mechanisms that the malaria parasite has evolved.

Tan et al., (1) showed that in an evolutionary arms race between human and malaria, our immune system has harnessed the property of RIFINs to bind to LAIR1 against the parasite itself. By doing single B cell isolation and sequencing, it was discovered that antibodies, which are the effector molecules of our immune system, can incorporate the LAIR1 protein in its structure. Taking into account our knowledge of antibody engineering, the idea of incorporating a 100 amino acid long protein into antibody structure is very hard to comprehend. Sequences of these antibodies showed that the LAIR1 insertion was introduced to CDR-H3. Recently, the crystal structure of this construct has become available (2). The crystal structure revealed that the LAIR1 insertion indeed is structurally functional. All 5 of antibody canonical CDRs interact with the LAIR1 protein and its linkers to accommodate the insertion. The CDR-L3 forms two disulfide bonds with the liker to orientate the LAIR1 protein in the way, it will interact with RIFINs. It is worth to stress that LAIR1 sequence differs from the wild type, but the structure is very similar (<0.5 RMSD). The change in sequence and structure is crucial to prevent the LAIR1 containing antibody from interacting with collagen, but only with RIFINs.

Pieper et al., (3) tried to interrogate the modality of LAIR1 insertions into antibody structures. It was performed by single cell sequences as well as NGS of the antibody shift region. It turns out that human antibodies can accommodate two types of insertion modalities and can form   camelid-like antibodies. The insertion of LAIR1 can happen to CDR-H3, leading to the loss of antibody binding to its cognate antigen. Another modality is the incorporation of the LAIR1 protein to the shift region of the antibody. This kind of insertion does not interfere with the Fv domain binding properties, which leads to creating of  bi-specific antibodies. The last finding was the insertion of the LAIR1 into antibody structure where D, J and most of V genes, and the light chain were deleted. The resultant scaffold is structurally viable and only possesses the heavy chain. Hence, it is the evidence that human antibodies can also form camelid-like antibodies. Interestingly, these insertions into the shift region are not exclusive to people that live in malaria endemic regions. By doing NGS of the shift domain from European donors, around 1 in 1000 antibody sequences had an insertion of varying lengths. These insertions are introduced from different chromosomes of both intergenic and genic regions.

To sum up, it is very intriguing that our immune system has evolved to create camelid-like and bi-specific antibodies. It will be very informative to try to crystallize these structures to see how these antibodies accommodate the insertion of LAIR1. Current antibody NGS data analysis primarily concentrates on the heavy chain due to sequencing technology limitations. It will be invaluable information if we could sequence the entire heavy chain as well as adjacent shift region to see how our immune system matures and activates against pathogens.

 

  1. Tan J, Pieper K, Piccoli L, Abdi A, Foglierini M, Geiger R, Maria Tully C, Jarrossay D, Maina Ndungu F, Wambua J, et al. A LAIR1 insertion generates broadly reactive antibodies against malaria variant antigens. Nature (2016) 529:105–109. doi:10.1038/nature16450
  2. Hsieh FL, Higgins MK. The structure of a LAIR1-containing human antibody reveals a novel mechanism of antigen recognition. Elife (2017) 6: doi:10.7554/eLife.27311
  3. Pieper K, Tan J, Piccoli L, Foglierini M, Barbieri S, Chen Y, Silacci-Fregni C, Wolf T, Jarrossay D, Anderle M, et al. Public antibodies to malaria antigens generated by two LAIR1 insertion modalities. Nature (2017) 548:597–601. doi:10.1038/nature23670

 

Helpful resources for people studying therapeutic antibodies

My work within OPIG involves studying therapeutic antibodies. It can be tough to find information about these commercial molecules, often known by unintelligible developmental names until the later stages of clinical trials. Their structures are frequently absent, as one might expect, but even their sequences are sometimes a nightmare to get hold of! Below is a list of resources that I have found particularly helpful.

IDENTITIES OF RELEVANT ANTIBODIES

1. Wikipedia (don’t judge!) is an extremely helpful resource to get started. They have the following databases:

(a) A list of FDA-approved therapeutic monoclonal antibody therapies
(b) A more general list of therapeutic, diagnostic and preventive monoclonal antibodies (includes some things that have been withdrawn)

2. The Antibody Society has list of FDA/EU approved and antibodies to watch on their website. NB: This is only available to members of the society (free for students and other concessions, standard membership is $100pa).

3. The journal ‘mAbs’ also has a series of ‘Antibodies to Watch in [Year]’ papers. Here are the ones for 2016, 2017 and 2018.

SEQUENCES

4. 137 clinical-stage (post-phase I) mAb sequences can be found in the SI of this paper by Jain et al.

5. A slightly outdated (last updated Nov 2016), but still extremely useful, resource of antibody seqeunces is this FASTA list, written by Dr Martin’s Group at UCL.

SEQUENCES & STRUCTURES

6. The IMGT monoclonal antibody database (mAb-DB) has been possibly the most helpful resource. This includes 798 entries of both therapeutics and non-therapeutics, so it’s helpful to get a list of the antibodies you are interested in first. You can search it with a wide range of parameters, including antibody name. A typical antibody result will include its mAb-DB ID, INN details, common & developmental names, species, receptor type and isotype, sequence (via the “IMGT/2Dstructure-DB” link), target, clinical trials details and – if available – the 3D structure (via the “IMGT/3Dstructure-DB” link).

7. SAbDab has a continually-updated section for all therapeutic antibody structures deposited in the PDB.

CURRENT STATUS OF THE THERAPEUTIC

8. Search the therapeutic name on AdisInsight, or Pharmacodia to see its current clinical trial status, and whether or not it has been withdrawn.

I just wanted TensorFlow

Finally got TensorFlow to install on my Mac. You’d be tempted to think, “Jin, it’s just a pip install, surely?”

No, MacOS begs to differ! You see, if you’re on a slightly older macOS version like I was (10.12), then you’d still be using TLS 1.0 – long story short, when querying PyPI via pip to get any packages on TLS 1.0, your requests will get rejected. And this cutoff was chosen something like a week ago – SAD! If you have MacOS 10.13 and onward, TLS should be set to 1.2 so you need not worry.

TL;DR:

  1. Get a new version of pip (10.0); see Stack Overflow post.
  2. Install any dependencies for pip as necessary by doing tons of source compilations.
  3. Install desired package(s) as necessary.