Monthly Archives: March 2017

Interesting Antibody Papers

This time round, one older paper, one recent paper. The older one talks about estimating how many H3s can there be in a human body based on sequencing of two individuals (they cap it at 9 million — not that much!). The more recent one is an attempt to define what makes a good antibody in terms of its developability properties (a battery of biophys assays on 150 therapeutic antibodies- amazing dataset to work with).

High resolution description of antibody heavy chain repertoires in (two) humans (Koralov lab at NYU). Here. Two individuals were sequenced and their VDJ frequencies measured. It is widely believed that the VDJ recombination events are largely independent and random. Here however they demonstrate some biases/interplay between the D and J regions. Since H3 falls on the VDJ junction, it might suggest that it affects the total choice of H3. Another quite important point is that they compared the productive vs nonproductive sequences (out of frame or with stop codons). If there were significant differences between the VDJ frequencies of productive vs nonproductive sequences, it would suggest selection at the VDJ frequency stage. However they do not see any significant differences here suggesting that VDJ combinations have little bearing on this initial selection step. Finally, they estimate the number of H3 in repertoire. The technique is interesting — they sample 1000 H3s from their set and see how many unique sequences it contributes. Each next sample contributes less and less unique sequences which leads to a log-decay curve. By doing so they get a rough estimate of when there will be no more new sequences added and hence an estimate of diversity (think why do this rather than counting the number of uniques!). They allow themselves to extrapolate this estimate to the whole organism by multiplying their blood sample by the total human body volume — they motivate this extrapolation by the fact that there were precious little overlaps between the two human subjects.

Biophysical landscape of clinical stage antibodies [here]. Paper from Adimab. Designing an antibody which binding its target is only a first step on the way to bring the drug on the market. The molecule needs to fulfill a variety of characteristics such as colloidal stability (does not aggregate or ‘clump up’), does not instantly clear from the organism (which is usually down to off target binding), is stable and can be expressed in reasonable quantities. In an effort to delineate what makes a good antibody, the authors take inspiration from earlier work on small molecules, namely the Lipinski Rules of Five. This set of rules describes what makes a ‘good’ small molecule drug, which was assessed by looking at ~2000 therapeutic drugs. The rules came down to certain numbers of hydrogen bond donors, acceptors, molecular weight & lipophilicity. Therefore, Jain et al would like a similar methodology, but for antibodies: give me an antibody and using methodology/rules we define, we will tell you either to carry on with development or maybe not. Since antibodies are far more complex and the data on therapeutic abs orders of magnitude smaller (around 50 therapeutic abs to date) Jain et al, had to devise a more nuanced approach than simply counting hb donors/acceptors mass etc. The underlying ‘good’ molecule data though is similar: they picked therapeutic antibodies and those in late clinical testing stages (2,3). This resulted in ~150 antibodies. So as to devise the benchmark ‘rules/methodology’, they went for a battery of assays to serve as a benchmark — if your ab raises too many red flags according to these assays, it’s not great (what constitutes a red flag to be defined). These assays were supposed to not be obscure and relatively easy to use as the point was that an arbitrary antibody can be relatively easy checked against them. The assays are a range of expression, cross reactivity, self reactivity, thermal stability etc. To define red flags, they run their therapeutic/clinical antibodies through the tests. To their surprise quite a lot of these molecules turn out to have quite ‘undesirable characteristics’. Following the Lipinski Rules, they define a red flag as being in the bottom 10th percentile of the assay values as evaluated on the therapeutic abs. They show that the antibodies which are approved or in more advanced clinical trials stages have less red flags. Therefore, the take-home messages from this paper: very nice dataset for any computational work, raising red flags does not disqualify you from being a therapeutic.

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

As the sole representative of OPIG attending Biophys 2017 in New Orleans, I had to bear the heavy burden of a long and lonely flight and the fear of missing out on a week of the very grey Oxford winter. Having successfully crossed the border into the US, which was thankfully easier for me than it was for some of our scientific colleagues from around the world, I found my first time attending the conference to be full of very interesting and relevant science. While also covering a wide variety of experimental techniques and non-protein topics, the conference is so large and broad that there was more than enough to keep me busy over the five days, featuring folding, structure prediction, docking, networks, and molecular dynamics.

There were several excellent talks on the subject of folding pathways, misfolding and aggregation. A common theme was the importance of the kinetic stability of the native state, and the mechanisms by which it may be prevented from reaching a non-native global thermodynamic minimum. This is particularly important for serpins, large protease inhibitors which inactivate proteases by a suicide mechanism. The native and active state can be transformed into a lower energy conformation over long timescales. However, this also occurs by cleavage near the C-terminal end, which allows insertion of the C-terminal tail into a beta sheet, holding the cleaving protease inactive and therefore the stored energy is very important for function. Anne Gershenson described recent simulations and experiments to elucidate the order in which substructures of the complete fold assemble. There are many cooperative substructures in this case, and N-terminal helices form at an early stage. The overall topology appears to be consistent with a cotranslational folding mechanism inside the ER, but requires significant rearrangements after translation for adoption of the full native fold.

Cotranslational folding was also discussed by several others including the following: Patricia Clark is now using the YKB system of alternately folding fluorescent protein to find new translation stalling sequences; Anais Cassaignau described NMR experiments to show the interactions taking place between nascent chains and the ribosome at different stalled positions during translation; and Daniel Nissley presented a model to predict a shift in folding mechanism from post-translational to cotranslational due to specific designed synonymous codon changes, which agreed very well with experimental data.

To look more deeply into the evolution of folding mechanisms and protein stability, Susan Marqusee presented a study of the kinetics of folding of RNases, comparing the properties of inferred ancestral sequences to a present day thermophile and mesophilic E. coli. A number of reconstructed sequences were expressed, and it was found that moving along either evolutionary branch from the ancestor to modern day, folding and unfolding rates had both decreased, but the same three-state folding pathway via an intermediate is conserved for all ancestors. However, the energy transition between the intermediate and the unfolded state has evolved in opposite directions even while the kinetic stability remains similar. This has led to the greater thermodynamic stability seen in the modern day thermophile compared to the mesophile at higher temperatures and concentrations of denaturant.

Panel C shows that kinetic stability (low unfolding rate) seems to be selected for in both environments. Panel D shows that the thermodynamic stability of the intermediate (compared to the unfolded state) accounts for the differences in thermodynamic stability of the native state, when compared to the common ancestor (0,0). Link to paper

There were plenty of talks discussing the problems and mechanisms of protein aggregation, with two focussing on light chain amyloidosis. Marina Ramirez-Alvarado was investigating how fibrils begin to grow and showed using microscopy that both soluble light chains and fibrils (more slowly) are internalised by heart muscle cells. They can then be exposed at the cell surface and become a seed to recruit other soluble light chains to form fibrils. Shannon Esswein presented work on the enhancement of VL-VL dimerisation to prevent amyloid formation. The variable domain of the light chain (VL) can pair with itself in a similar orientation to its pairing with VH domains in normal antibodies, or in a non-canonical orientation. Adding disulphide bonds to stabilise these dimers prevented fibril formation, therefore they carried out a small scale screen of 27 aromatic and hydrophobic ligands to find those which would favour dimer formation by binding at the interface. Sulfasalazine was detected in this screen and was also shown to significantly reduce fibril formation and could therefore be used as a template for future drug design.

A ligand stabilises the dimer therefore fewer light chains are present as monomers, slowing the rate of the only route by which fibrils can be formed. Link to paper

Among the posters, Alan Perez-Rathke presented loop modelling by DiSGro in beta barrel membrane proteins which showed that the population of structures generated and scored favourably after relaxation at a pH 7 led to an open pore more often than at pH 5, consistent with experimental observations. There were two posters on the topic of prediction of membrane protein expression in bacteria and yeast presented by students of Bill Clemons, who also gave a great talk. Shyam Saladi has carefully curated datasets of successes and failures in expression in E. coli and trained a linear SVM on features such as RNA secondary structure and transmembrane segment hydrophobicity to predict the outcome for unknown proteins. This simple approach (preprint available here) achieved area under ROC curve of around 0.6 on a separate test set, and using more complex machine learning techniques is likely to improve this. Samuel Schulte is adapting the same method for prediction of expression in yeast.

Overall, it was a great conference and it was nice to hear about plenty of experimental work alongside the more familiar computational work. I would also highly recommend New Orleans as an excellent place to find great food, jazz and sunshine!

Using Antibody Next Generation Sequencing data to aid antibody engineering

       I consider myself a wet lab scientist and I had not done any dynamic programming language like Python before starting my DPhil. My main interests lie in development of improved antibody humanization campaigns, rational antibody phage display library constructions and antibody evolution. Having completed industrial placement at MedImmune, I saw the biotechnology industry from the inside and realized that scientists who could bridge computer science and wet lab fields are in high demand.

      The title of my DPhil is very broad, and research itself is data rather than hypothesis driven. Our research group collaborates with UCB Pharma, which has sequenced whole antibody repertoires across a number of species. Datasets might contain more than 10 million sequences of heavy and light variable chains. But even these datasets do not cover more than 1% of the theoretical repertoire, hence looking at entropies of sequences rather than mere sequences could provide insights into differences between intra- and inter- species datasets.

        NGS of antibody repertoires provides snapshots of repertoire diversity, entropy as well as sequences. Reddy, S.T. et al 2010 showed that this information could be successfully used to pull target specific variable chains. But most of research groups believe that main application of NGS is immunodiagnostics (Grieff et al., 2015).

       My project involves applying software developed by our research group namely, Anarci (Dunbar J and Deane CM., 2016) and ABodyBuilder (Leem J. et al 2016). Combination of both softwares allows analysis of NGS datasets at an unprecedented rate (1 million sequences per 7 hours). A number of manipulations can be performed on datasets to standardize them and make data reproducible, which is a big issue in science. It is possible to re-assign germlines, numbering schemes and complementary determining region (CDR) definitions of a 10 million dataset in less than a day. For instance, UCB provided data required our variable chains to be re-numbered according to IMGT numbering and CDR definition (Lefranc M., 2011). The reason for the IMGT numbering scheme selection is that it supports symmetrical amino acid numbering of CDRs, which allows for improved assignment of positions to amino acids that are located in the same structural space between different length CDRs (Figure 1).

                Figure 1. IMGT numbering and CDR definition of CDR3. Symmetrical assignment of positions to amino acids in HCDR3 allows for better localization of V,D,J genes: V gene encodes for the amino terminus, J gene encodes the carboxyl terminus of CDR3, and D gene the mid portion.

       To sum up, analysis of CDR lengths, CDR and framework amino acid compositions, finding novel patterns in antibody repertoires will open up new rational steps of antibody humanization and affinity maturation. The key step will be to determine amino acid scaffolds that define humanness of antibody or in other words, scaffolds that are not immunogenic in humans.

References:

  1. Dunbar J., and Deane CM., ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics (2016)
  2. Grieff V., A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine (2015)
  3. Leem J., et al. ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimation. mAbs. (2016)
  4. Lefranc M., IMGT, the International ImMunoGeneTics Information System. Cold Spring Harb Protoc. (2011)
  5. Reddy ST., et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat Biotech. (2010)

Multiomics data analysis

Cells are the basic functional and structural units of living organisms. They are the location of many different biological processes, which can be probed by various biological techniques. Until recently such data sets have been analysed separately. The aim is to better understand the underlying biological processes and how they influence each other. Therefore techniques that integrate the data from different sources might be applicable [1].

In the image below you see the four main entities that are active throughout the cell: Genome, RNA, proteins, and metabolites. All of them are in constant interaction, for example, some proteins are transcription factors and influence the transcription of DNA into RNA. Metabolites that are present in the cell also influence the activity of proteins as ligands but at the same time are altered through enzymatic activity. This ambiguity of interactions makes it clear that probing the system at a single level gives only limited insight into the structure and function of the cellular processes.

 

multiomics_schematic

The different levels of biological information (genome, proteome, …) work mutually and influence each other through processes as transcription regulation through transcription factors. All levels are influenced by external factors, as drug treatment or nutrient availability. Multiomics is the measurement of multiple of those populations and their integrated analysis.

In the last years, different ways to integrate such data have been developed. Broadly speaking there are three levels of data integration: conceptual integration, statistical integration, and model-based integration [2]. Conceptual integration means that the data sets are analysed separately and the conclusions are compared and integrated. This method can easily use already existing analysis pipelines but the way in which conclusions are compared and integrated is non-trivial. Statistical Integration combines data sets and analyses them jointly, reaching conclusions that match all data and potentially finding signals that are not observable with the conceptual approach. Model-based integration indicates the joint analysis of the data in a combination of training of a model, which itself might incorporate prior beliefs of a system.

[1] Gehlenborg, Nils, Seán I. O’donoghue, Nitin S. Baliga, Alexander Goesmann, Matthew A. Hibbs, Hiroaki Kitano, Oliver Kohlbacher et al. “Visualization of omics data for systems biology.” Nature methods 7 (2010): S56-S68.

[2] Cavill, Rachel, Danyel Jennen, Jos Kleinjans, and Jacob Jan Briedé. “Transcriptomic and metabolomic data integration.” Briefings in bioinformatics 17, no. 5 (2016): 891-901.