Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Because not all interesting biology is health-related!

Nowadays, biological research science spins around health: Cancer. Neuroscience. Immunology. Pharmacology. And many more health-related areas which are being deeply studied. It seems that everyone is keen to spend their lives looking for the cure of cancer or Alzheimer. What a drag! For this reason (and also to show that research in less popular and less founded sectors can also improve significantly human lives), I have decided to write about something completely different: plant microbiome!

Indeed, I am going to write about bacteria. And no, they are not related to health at all. These bacteria live the soil and infect plants. However, they are not “bad”. Actually, they favour the plant’s growth and development. This is possible thanks to a fascinating process which finishes (ALERT SPOILER!!) with the bacteria transforming the atmospheric nitrogen into ammonia that can be used by the plant (nitrogen fixation).

The process starts with some kind of small talk between Rhizobium (the bacteria) and the legume (the plant): Legumes secrete compounds through their roots that the bacteria living close by can detect. In response to this stimulus, bacteria approach the root hairs of the plant and attach and secrete lipo-chitooligosaccharides known as Nod factors.

It continues with some action: The plants sense the Nod factors, which induce the root hairs curling and trapping the bacteria. The bacteria continue to grow and eventually form an infection thread whose growth allows the bacteria to reach other plant cells.

And it finishes with a happily ever after ending: A structure called a nodule is formed. The bacteria in the nodule form an organelle called the symbiosome, within which the bacteria differentiate to a state called bacteroid. In this stage, the bacteroid fixes nitrogen for the plant.

I know… Everything has happened too fast (the process can take 1 – 2 weeks). And I have not been bothered to explain it in detail so you can enjoy reading this amazing review: https://www.ncbi.nlm.nih.gov/pubmed/23493145

But wait! I almost forget to say why is worth studying this… The point is that plants need nitrogen to grow and they cannot use atmospheric nitrogen. Therefore, the more nitrogen they receive from the bacteria, the more they will grow. Consequently, we may increase the quantity of food available by improving this process.

Vim and I

Vim is great. Despite its steep learning curve , it has many advantages and many loyal Vim followers will tell you that you should force yourself to use it.

Personally I started using Vim when I was ssh-ing into the group servers or into my computer in department. In such scenarios, I could not open the IDEs with the nice GUIs 🙁 However, as time passed, Vim started to grow on me. Now, I can list a few reasons why I think it is great, for example, it requires a small amount of memory to run, has a short start up time and can handle large files pretty well. 

Although, I am definitely not a Vim expert, I will tell you about some of the things I have added to my .vimrc. The .vimrc file is very handy for containing all your favourite settings, such as key mappings, custom commands, formatting and syntax highlighting. The file uses vimscript which is a programming language in itself. However, there is a lot of help online that tells you with what lines to add to your .vimrc. I would recommend installing Vundle which is a Vim plugin manager. 

Here I will list some cool things that I have discovered you can do with your .vimrc.   It has certainly made my life a bit nicer.

  1. Code Folding
    Most IDEs provide a way to collapse functions and classes that results in only seeing the function/class definition and hiding the code. To do this in Vim add the following lines to your .
    vimrc 

    " Enable folding
    set foldmethod=indent
    set foldlevel=99
    " Enable folding with the spacebar
    nnoremap <space> za


    Alternatively, you can install the Vim plugin SimpylFold.

  2. Python indentation
    Vim does not do auto indention like many IDEs. To automatically do PEP-8 indentation for Python, add the following to your .vimrc . 

    " PEP indentation
    au BufNewFile,BufRead *.py
    \ set tabstop=4    
    \ set softtabstop=4    
    \ set shiftwidth=4    
    \ set textwidth=79    
    \ set expandtab    
    \ set autoindent    
    \ set fileformat=unix
    

    You can also install the Vim plugin vim-flake8 which is a static syntax and style checker for Python source code. It shows errors in a quickfix window and lets you jump to their location inside your code.

  3. Turn line numbers on 
    Rather than typing in  
    :set nu 
    every time you open your files. You can always have them turned on by adding :set nu to your .vimrc
  4. Autocompletion 
    When I switch from PyCharm to Vim I feel a bit lost without the autocompletion however, after a quick search I found many are using the Vim package Youcompleteme and it is awesome. 

Finding the lowest energy conformation of given molecule!

Generating low-energy molecular conformers is important for many areas of computational chemistry, molecular modeling and cheminformatics. Many tools have been developed to generate conformers, including BALLOON (1), Confab (2), FROG2 (3),  MOE (4), OMEGA (5) and RDKit (6). The search algorithm implemented in these tools can be broadly classified as either systematic or stochastic. These algorithms primarily focus on generating geometrically diverse low-energy conformers. Here, we are interested in finding lowest energy conformation of a molecule instead of achieving geometric diversity and Bayesian optimization is used to find the lowest energy conformation (7). Continue reading

What can you do with the OPIG Antibody Suite?

OPIG has now developed a whole range of tools for antibody analysis. I thought it might be helpful to summarise all the different tools we are maintaining (some of which are brand new, and some are not hosted at opig.stats), and what they are useful for.

Immunoglobulin Gene Sequencing (Ig-Seq/NGS) Data Analysis

1. OAS
Link: http://antibodymap.org/
Required Input: N/A (Database)
Paper: http://www.jimmunol.org/content/201/8/2502

OAS (Observed Antibody Space) is a quality-filtered, consistently-annotated database of all of the publicly available next generation sequencing (NGS) data of antibodies. Here you can:

Continue reading

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

Continue reading

Cinder: Crystallographic Tinder

Protein structure determination is still dominated by xray diffraction. For diffraction studies structural biologists need to grow and optimise protein crystals until they diffract to an usable and optimal resolution. A purified protein sample is exposed to a number of crystallisation screens, each comprising a selection of chemical conditions that are designed to explore a reasonably wide area of potential crystallisation conditions.

Many crystallography labs routinely image these in large plate storage systems, which reduces the human interaction to viewing a set of usually 100-1000 images at various time points. This is a slow and laborious process, and highly applicable to machine learning approaches tailored to looking at images. TexRank, a texton analysis ranking software was developed by Jia Tsing in OPIG and is used at the Structural Genomics Consortium (SGC). This ranking reduces the number of images that a human needs to search through, providing a quicker review process.  Continue reading

Rasmus Fonseca and GetContacts

We welcomed Rasmus Fonseca to last week’s OPIG Group Meeting. Rasmus is currently a Visiting Scholar at Stanford. He gave a fascinating talk about the interaction analysis of molecular structures and ensembles using the GetContacts package, one of many projects that he has contributed to that you can find on his GitHub repo.

Rasmus was kind enough to share his slides with us:

https://docs.google.com/presentation/d/1HmN9AuU4gL-jMlJdR6cMleueQ-nRWOE_hiWtO8OQEoo/edit?usp=sharing.

He is looking for new users (and developers), so if you have questions, he would be very happy to help get you started.

ISMB 2018 (Chicago): Summary of Interesting Talks/Posters

Catherine’s Selection

Network approach integrates 3D structural and sequence data to improve protein structural comparison

Why: Current graph mapping in protein structural comparison ignores sequence order of residues. Residues distant in sequence but close in 3D space are more important.
How: Introduce sequence order of residues, set a sequence-distance cutoff to consider structurally important residues, count the graphlet frequency and embed into PCA space.
Results: the new method is predictive of SCOP and CATH ‘groups’. Certain graphlets are enriched in alpha and beta folds.
Link: https://www.nature.com/articles/s41598-017-14411-y

Investigating the molecular determinants of Ebola virus pathogenicity

Why: Reston virus is the only Ebola virus that is not pathogenic to human
What they do: multiple sequence alignment to look for specificity determining positions (SDPs) using s3det, then predict the effect of each individual SDP on the stability of the protein with mCSM.
Results: VP40 SDPs alter octamer formation, structure hydrophobic core. VP24 SDPs leads to impair binding to KPNA5 in human, which inhibits interferon signalling.
Impact: only a few SDPs distinguish Reston VP24 from VP24 of others. Human-pathogenic Reston viruses may emerge.
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5558184/#__ffn_sectitle

Computational Analysis Highlights Key Molecular Interactions and Conformational Flexibility of a New Epitope on the Malaria Circumsporozoite Protein and Paves the Way for Vaccine Design

Why: An antibody with a strong binding affinity was found in a group of subjects. This antibody prevents cleavage of the surface protein.
What they do: They found the linear epitope, crystallise the strong and medium binders and run a molecular dynamic simulation to find out the flexibility of the structures.
Results: The strong binder is less flexible. Moreover, the strong binder is similar to the germline sequence which may mean that this antibody could have been readily formed.
Link: https://www.nature.com/articles/nm.4512



Matt’s Selection

“Analysis of sequence and structure data to understand nanobody architectures and antigen interactions”
Laura S. Mitchell (Colwell Group)
University of Cambridge, UK

This poster detailed the work from Laura’s two most recent publications, which can be found here: https://doi.org/10.1002/prot.25497, https://doi.org/10.1093/protein/gzy017

They describe a comprehensive analysis of the binding properties of the 156 non-redundant nanobody-antigen (Nb-Ag) complexes in the PDB/SAbDab (October 2017). Their analyses include Nb sequence variability (both global and across the binding regions), contact maps of nanobody-antigen interactions by region, and the typical chemical properties of each paratope. Nb-Ag complexes are compared to a reference set of monoclonal antibody-antigen (mAb-Ag) complexes. This work is a key first step in advancing our understanding of Nb paratopes, and will aid the development of new diagnostics and therapeutics.

OSPREY 3.0: Open-Source Protein Redesign for You, with Powerful New Features”
Jeffrey W. Martin (Donald Group)
Duke University, USA

OSPREY 3.0 (https://www.biorxiv.org/content/early/2018/04/23/306324) represents a large advance towards time-efficient continuous flexibility modelling of protein-protein interfaces.

Its new algorithms LUTE and BBK* allow for continuous rotamer flexibility searching and entropy-aware binding constant approximation in a much more efficient manner. The CATS algorithm also introduces local backbone flexibility as a long-awaited feature. This software now has a easy-to-use Python interface, and is fully Open-Source, making it an extremely attractive alternative to other proprietary protein design tools.

“Functional annotation of chemical libraries across diverse biological processes”
Scott Simpkins
University of Minnesota-Twin Cities, USA

This interesting talk detailed the work published in Nature Chemical Biology in September 2017 (https://doi.org/10.1038/nchembio.2436).

310 yeast gene-deletion mutants were isolated to perform chemical-genetic profile studies across six diverse small molecule high-throughput screening libraries. By studying which gene-deletion mutants were hypersensitive or resistant to each compound, the researchers could assign most members of each chemical library a probable functional annotation. Mapping back to gene-interaction profile data also allowed them to infer likely targets for some compounds. The GO annotations associated with these genes could then be used assess whether a given starting library is likely to contain promising starting-points that affect a given biological function. For example, the authors highlighted a deficiency across all libraries against the cellular processes of cytokinesis and ribosome biogenesis. Conversely, they found a large enrichment across all libraries for compounds likely to affect glycosylation or cell wall biogenesis. Compounds that target transcription and chromatin organisation were found to be enriched in certain datasets, and depleted in others. This genre of profiling provides researchers a way of judging a priori whether a given screening library is likely to contain promising lead compounds, given the functional role of the target of interest.

Protein Engineering and Structure Determination

Sometimes it can be advantageous to combine two proteins into one. One such technique was described by Jennifer Padilla, Christos Colovos, and Todd Yeates back in 2001 (Padilla, et al., 2001). By connecting two proteins, one that dimerized, and another that trimerized, they were able to design synthetic ‘nanohedra’. The way they achieved this was by extending a C-terminal α-helix at the end of one protein by another α-helix ‘linker’, directly into the N-terminal α-helix of another protein:

Continue reading

Prague Protein Spring 2018

We, Constantin and Dominik, the newest members of OPIG (SABS rotation students, as usual) were lucky to have a conference suitable to our research within our rotation period and, granted an allowance from the powers that be, were able to visit this year’s Prague Protein Spring with the topic ‘Proteins at Work’. There, we spent four busy but very inspirational days with about 50 participants in a little palace, the Vila Lanna.

The general topic of this meeting led to a broad variety of talks representing a multitude of fields of protein research: from origins of life, over fuzzy intrinsically disordered proteins and crowded cells to metagenomics and functional sequence alignment annotation.

We picked four thought engaging talks to present at the group meeting on 08/05/2018; here are their summaries:

Protein engineering and in vitro evolution studies for the origins of life

Kosuke Fujishima from Tokyo Institute of Technology presented several examples of the research he conducts in the area of origins of life. Research on the origins of life are generally based around the questions how prebiotic monomers were created, how they condensed into polymers and how functionality emerged within these polymers.

The first example of his research deals with the condensation of prebiotic monomers on the ocean-earth crust-interface. Water cycling between the ocean and the outer layers of the earth’s core provided an environment of high pressure and high temperatures (80 – 200 °C) which is necessary for amino acid polymerisation. The mineral Olivine was found to attract amino acids to its surface and the serpentinisation reaction happening with Olivine might provide the necessary wet/dry cycle. Therefore, the researchers built a reactor aiming to investigate this potential polymerisation mechanism. They found that with providing six prebiotic amino acids, 28 out of 36 possible dipeptides could be found in the reactor. Furthermore, up to 10-mer linear polypeptides could be detected as well, providing evidence for a mechanism of early earth’s generation of polypeptides [unpublished].

The second project showed that both enzymes, CysE/CysK, responsible for the current production of cysteine from serine, could be re-engineered to contain no cysteine in their sequence. Interestingly, cysteine-free CysE showed higher reaction rates than the wild type. Additional reduction to cysteine- and methionine-free enzyme sequences only worked for CysE but not for CysK.[Fujishima et al. (2018)] Still, the experiments indicate that an enzyme world could have existed with a reduced number of amino acids compared to the 20(+) amino acids that we know today.

The third project we wanted to point out used a type of mRNA display that not only links the genotype (mRNA) with its corresponding phenotype (translated protein) but also allows the translated protein to interact with a randomised, non-translated part of the mRNA. This provided a framework for investigating the evolution of ribonucleotide-binding (RNP) proteins. When selecting for ATP-binding, it was observed that protein together with RNA had the best fitness landscape compared to protein selection or RNA selection alone. Further analysis revealed that most binding affinity of the ribonucleotide protein stemmed from its RNA part.[unpublished] These results give rise to the suggestion that RNA and proteins co-evolved, opposing the idea of a pure RNA world.

RNA-protein interactions and the structure of the genetic code

The next speaker added more to the research area of RNA-protein interaction and evolution. Bojan Zagrovic from the University of Vienna presented his research around the finding that pyrimidine (PYR) density of RNA regions is correlated with the corresponding protein region’s affinity to pyrimidine-containing bases (running means of 21 amino acids or 63 bases were used), with the highest correlation between mRNA PYR density and guanine affinity, having an average ‘typical’ Pearson correlation coefficient of 0.80.[Polyansky & Zagrovic (2013)]

This correlation is specific for the current genetic code, shown by random generation of genetic codes which could not reproduce such a correlated behaviour and by looking into three organisms with very different codon usage bias (homo sapiens, E. coli, M. jannaschii). Even though the three averages of codon usage were very different, the highest correlating pairs of mRNA and cognate proteins clustered together, having very similar codon usage. This was also true for the worst correlating pairs.[Hlevnjak & Zagrovic (2015)]

But the big question being: what does this correlation imply functionally?

Annotation analysis revealed that the highest correlating pairs were enriched in nucleotide-binding functions and intrinsically disordered proteins. Without claiming generalisability, Professor Zagrovic pointed out a case study done on RNA polymerase II which has a long disordered C-terminus build up by 26 repeats of a 7 amino acid motif. 248 RNAs were found to interact with RNA polymerase II and in all three reading frames of the interacting RNAs, amino acid codons of the polymerase’s C-terminus were enriched.[unpublished]

This indicates some regulation over gene expression but also several other hypotheses were made: the correlation between the protein regions’ affinity for their cognate mRNA regions might be relevant in virus assembly, since coding RNA and translated proteins have to be in close proximity with each other. The same could be true for some non-membrane-bound compartments, e.g. P-bodies. Or is this correlation characteristic a hint to mRNAs acting as chaperones for their respective proteins? The functional implications of this correlation, while highly speculative, nevertheless suggest exciting research to come in the future.

Fuzziness in protein assemblies

Research from a different, but equally thought provoking field was presented by Mónika Fuxreiter from the University of Debrecen. Her talk on the concept of fuzziness in protein complexes, which she introduced 10 years ago [Tompa & Fuxreiter (2008)], shed light on some more recent developments in the field as well as explaining the underlying concept for those of us (ourselves included) who have not encountered the concept as such before.

Fuzziness in the context of protein complexes describes a phenomenon in which intrinsically disordered proteins, instead of folding upon binding as one would usually observe, can sample several conformational states with different propensities, leading to the sampled states contributing with different strengths to the function of the protein complex and further leading to varying degrees of disorder in the bound state.

This observation has several implications for the understanding of the functionality of disordered proteins, since the relative propensity for different ensemble states in the bound form is thought to be highly susceptible to milieu influences, such as tissue specific splicing and post-translational modifications. Fuzziness (a term that was borrowed from the mathematical theory of fuzzy sets) could thus be a driver of functional adaptability of disordered proteins to cell-cycle stage, environmental influences or tissue type.

Evidence for fuzziness has been curated by the Fuxreiter group since 2015 [Miskei et al. 2017] in the FuzDB database and recently been used to develop a prediction algorithm [unpublished], that according to Professor Fuxreiter achieves highly accurate predictions of fuzziness on a comprehensive validation dataset.

Both the implications of fuzziness for the understanding of the mode of action for disordered proteins (and disordered regions in otherwise ordered proteins) certainly spiked our interest, not least due to the potential importance of a clear understanding of these mode of actions for drug development.

Investigation of mutually exclusive splicing events using the CATH FunFam framework

The last of the 4 talks we would like to single out in this blogpost highlighted recent progress in using structure-based databases for the investigation of complex cellular events.

Christine Orengo from UCL presented her group’s work on mutually exclusive splicing, which employed the FunFam framework of the CATH database to probe the structural and functional implications of these splicing events [Lam et al. (2018), under review].

The FunFams are a subcategory of CATH’s homologous superfamilies, which further divides the superfamilies based on clusters of residue conservation within each family, thus creating groupings of functionally related proteins [Rentzsch & Orengo (2013)].

Mutually exclusive splicing that were investigated using this framework are a group of splicing events in which only one of several specific exons is present in the spliced mRNA. These exons usually show a high level of sequence similarity, leading to a low disruption of the protein structure by the splicing event. It is thought that this feature is a reason for the relative enrichment of mutually exclusive exons amongst alternative splicing events in the proteome.

This high degree of sequence similarity further enabled the mapping of the mutually exclusive exons to FunFams in the CATH database and thus further onto protein structures. This allowed the Orengo group to conduct a ‘large scale systematic study of the structural/functional effects of MXE splicing’.

Their analysis found that variable residues between the exons are significantly enriched at the protein surface, both compared to other stretches of the protein sequence and compared to non-variable residues in the exons, and in close proximity (< 6 Angstroms) to functional sites of the protein.

The main conclusion drawn from these findings was that, as previously hypothesised, mutually exclusive exons are likely functional switches, since changes in the surface exposed area close to functional sites are likely to affect the protein function without strongly disrupting its structure.

In the eyes of the Orengo group, this makes these splicing events good candidates for drug targeting, particularly in cases where a tissue specific isoform can be drugged, since in that case off-target effects could potentially be significantly reduced.

Sources:

Fujishima et al. (2018). Reconstruction of cysteine biosynthesis using engineered cysteine-free enzymes. Scientific Reports

Hlevnjak & Zagrovic (2015). Malleable nature of mRNA-protein compositional complementarity and its functional significance. Nucleic Acids Res

Lam, S. D., Orengo, C., & Lees, J. (2018). Protein structure and function analyses to understand the implication of mutually exclusive splicing. BioRxiv

Miskei, M. et al (2017). FuzDB: Database of fuzzy complexes, a tool to develop stochastic structure-function relationships for protein complexes and higher-order assemblies. Nucleic Acids Research

Polyansky & Zagrovic (2013). Evidence of direct complementary interactions between messenger RNAs and their cognate proteins. Nucleic Acids Res

Rentzsch, R., & Orengo, C. A. (2013). Protein function prediction using domain families. BMC Bioinformatics

Tompa, P., & Fuxreiter, M. (2008). Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends in Biochemical Sciences