Author Archives: Clare West

Combining Inset Plots with Facets using ggplot2

I recently spent some time working out how to include mini inset plots within ggplot2 facets, and I thought I would share my code in case anyone else wants to achieve a similar thing. The resulting plot looks something like this:

Continue reading →

A Brief Introduction to ggpairs

In this blog post I will introduce a fun R plotting function, ggpairs, that’s useful for exploring distributions and correlations.

Continue reading →

Not-Proteins in Parliament

Last term I took a break from folding proteins to spend three months working in Westminster at the Parliamentary Office of Science and Technology (POST).

The UK Research and Innovation (UKRI) Policy Internships Scheme gives PhD students the opportunity to spend three months in a range of policy-relevant organisations, from Government departments to the Royal Society. Applications are open to research council funded PhD students (currently including EU students). The scheme includes a three-month stipend extension, and travel/accommodation expenses are covered either by the host partner or the training grant holder.

Continue reading →

Introduction to R Markdown

Two of our esteemed OPIGlets presented a workshop on collaborative research using Jupyter Notebook this week at ISMB in Chicago. Their workshop highlights the importance of finding ways to share your work conveniently and reproducibly. So on a related note, I thought I would share a brief introduction to another useful tool, R Markdown with RStudio, which I use to present updates to various supervisors and to remember what I did three months (or three days) ago. This method of sharing work is highly readable, reproducible, and narrative-driven.

I use R for much of my data analysis and all of my visualisation, and I count the tidyverse among my most beloved friends. If you’re so inclined, it’s easy to execute python, bash, and more from within R Markdown. You also don’t need to use RStudio to use R Markdown, but that’s a whole other story.

Starting a new markdown file in RStudio will generate a template script explaining most of what you need to know. If I showed you that then I’d be out of a blog post, but I will at least link to the R Markdown Reference Guide.

R Markdown files consist of text written in markdown, and code chunks that can be individually executed and displayed inline within RStudio. To “knit” the whole thing together, the knitr package is used to execute and combine code chunks, then pandoc converts the whole thing into an attractive document.

Here’s an example. The metadata at the top sets up the document. I’ll be generating an html document here, but notice some other tempting examples commented out. Yes, you can use it for Latex (swoon). You can even make a Word document, but really, why would you?

---
title: "Informative Title"
author: "Clare E. West"
date: "10/07/2018"
output: html_document
#output: beamer_presentation
#output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
```

## Big Title
### Smaller title

R Markdown scripts have the extension .Rmd

R Markdown is __so__ *fun*. You can read all about it [here](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).

```{r}
print("Hello world")
```

Notice that chunks are enclosed within three backticks, with the language and options in braces. Single commands can be executed inline using single backticks.

As highlighted in the example above, global options are set like this:

knitr::opts_chunk$set(echo = TRUE)

“echo=TRUE” means that the code in each chunk is displayed in the final product; this is useful to show collaborators (or your future self) exactly how you did something. Change this option (“echo = FALSE”) globally or in individual blocks to prevent code from printing. This is useful to hide uninteresting commands, or when presenting to people who don’t have the time or inclination to read your code (hard to imagine). Notice I’ve also used “include = FALSE” for the library-loading code chunk, which means evaluate but don’t include in the output. Another useful option is “eval = FALSE”, which means don’t even run this chunk.

So let’s see what that looks like when we render it:

The above example output as HTML

The above example output as Latex

Plots generated in code chunks or images from other sources can be embedded. Set the width in the options. “fig.width” sets the width (in inches) of the figure generated, while “out.width” scales the image in the final documents, for which the units will depend on the document type. Within RStudio, these are previewed inline below the code chunk.

## Including plots/images
```{r fig.width = 4, fig.height = 3, out.width = "400px", echo=FALSE}
t  %>% group_by(Tour, Winner, N, Tournament) %>% filter(WRank <= 20) %>% summarise(WPts = max(WPts))  %>% ggplot(aes(x=N, y=WPts, group=Winner, colour=(Winner=="Murray A."))) + geom_point() + geom_line() + labs(x="Tournament Number",y="Ranking Points") + scale_colour_discrete("",labels=c("Not Andy Murray", "Andy Murray")) + theme_bw() + theme(legend.position = "bottom", legend.margin = margin(0, 0, 0, 0))
knitr::include_graphics("https://s.yimg.com/ny/api/res/1.2/69ZUzNSMYb09GKd8CNJeew--~A/YXBwaWQ9aGlnaGxhbmRlcjtzbT0xO3c9ODAwO2g9NjAw/http://media.zenfs.com/en_us/News/afp.com/0102e1f7d0d3c35303c8a62d56a5eb79c2c8b4d8.jpg")
```

Rather than just printing data R-style, you can nicely format it into a table using kable (part of knitr). I also style mine using kableExtra, which makes it look nice and gives you extra options. By default tables fill the full width, you can override this using e.g. kable_styling(full.width = FALSE, position = “left”). When making a latex document, use kable(table, booktabs = T, “latex”) to get a (reproducible) latex-style table.

Here’s how to use python and bash. Thanks to the package reticulate, you can even share objects between your R and Python chunks. Exclude reticulate (knitr::opts_chunk$set(python.reticulate=FALSE) if you prefer to keep your languages separate.

### Mix it up with python
```{python}
a='Wow python'
print(a.split()[0])
```

What a wild ride. 

### or bash

```{bash, echo=TRUE}
ls | head 
```

Oh look, there's our output, ready to share.

Finally, if you hate GUIs – and you know I do – you can ditch the interactive notebook part and just generate documents from R Markdown files like this:

rmarkdown::render("BlogExample.Rmd")

Biophysical Society 62nd Annual Meeting

In February I was very fortunate to attend the Biophysical Society 62nd Annual Meeting, which was held in San Francisco – my first real conference and my first trip to North America. Despite arriving with the flu, I had a great time! The conference took place over five days, during which there were manageable 15-minute talks covering a huge range of Biophysics-related topics, and a few thousand more posters on display (including mine). With almost 6,500 attendees, it was also large enough to slip across the road to the excellent SF Museum of Modern Art without anyone noticing.

The best presentation of the conference was, of course, Saulo’s talk on integrating biological folding features into protein structure prediction [1]. Aside from that, here are a few more of my favourites:

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

Here in the COFFEE (COtranslational Folding Family of Expert Enthusiasts) office, we love to talk about the vectorial nature of cotranslational folding and how it contributes to the efficiency of protein folding in vivo. Micayla Bowman and Patricia Clark have created a novel technique that will allow the effects of this vectorial folding to be investigated specifically in vitro.

The Clp complex grabs, unfolds and degrades proteins (diagram from [3]). ClpX, the translocase unit of this complex, was used to recapitulate vectorial protein refolding in vitro for the first time.

ClpX is an A+++ molecular motor that grabs proteins and translocates them through its pore. In vivo, its role is to denature substrates and feed them to an associated protease (ClpP) [3]. Bowman & Clark have used protein tags to initiate translocation of the target protein through ClpX, resulting in either N-C or C-N vectorial refolding.

The YKB construct used to demonstrate the vectorial folding mediated by ClpX (diagram from [4]).

They demonstrate the effect using YKB, a construct with two mutually exclusive native states: YK-B (fluoresces yellow) and Y-KB (fluoresces blue) [4]. In vitro refolding results in an equal proportion of yellow and blue states. Cotranslational folding, which proceeds in the N-C direction, biases towards the yellow (YK-B) state. C-N refolding in the presence of ClpX and ATP biases towards the blue (Y-KB) state. With this neat assay, they demonstrate that ClpX can mediate vectorial folding in vitro, and they plan to use the assay to investigate its effect on protein folding pathways and yields.

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

This work addresses the ambiguity of domain definition by assigning multiple possible domain boundaries to protein structures. Their automated method, SWORD (Swift and Optimised Recognition of Domains), performs protein partitioning via the hierarchical clustering of protein units (PUs) [7], which are smaller than domains and larger than secondary structures. The structure is first decomposed into protein units, which are then merged depending on the resulting “separation criterion” (relative contact probabilities) and “compactness” (contact density).

Their method is able to reproduce the multiple conflicting definitions that often exist between domain databases such as SCOP and CATH. Additionally, they present a number of cases for which the alternative domain definitions have interesting implications, such as highlighting early folding regions or functional subdomains within “single-domain” structures.

Alternative SWORD domain delineations identify (R) an ultrafast folding domain and (S,T) stable autonomous folding regions within proteins designated single-domain by other methods [6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

The authors of this work used optical tweezers to study the cotranslational folding of the first two domains of 5-domain protein elongation factor G.

In agreement with a number of other presentations at the conference, they report that interactions with the ribosome surface during the early stages of translation slows folding by stabilising disordered states, preventing both native and misfolded conformations. They found that the N-terminal domain (G domain) folds independently, while the subsequent folding of the second domain (Domain II) requires the presence of the folded G domain. Furthermore, while partially extruded, unfolded domain II destabilises the native G domain conformation and leads to misfolding. This is prevented in the presence of the chaperone Trigger factor, which protects the G domain from unproductive interactions and unfolding by stabilising the native conformation. This work demonstrates interesting mechanisms by which Trigger factor and the ribosome can influence the cotranslational folding pathway.

Optical tweezers are used to interrogate the folding pathway of a protein during stalled cotranslational folding. Mechanical force applied to the ribosome and the N-terminal of the nascent chain causes unfolding events, which can be identified as sudden increases in the extension of the chain. (Figure from [9])

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

The prediction of protein contacts from primary sequence is an enormously powerful tool, particularly for predicting protein structures. A major limitation is that current methods using coevolution inference require a large multiple sequence alignment, which is not possible for targets without many known homologous sequences.

In this talk, Thrasyvoulos Karydis presented CoMET (Convolutional Motif Embeddings Tool), a tool to predict protein contact maps without a multiple sequence alignment or coevolution data. They extract structural and sequence motifs from known sequence-structure pairs, and use a Deep Convolutional Neural Network to associate sequence and structure motif embeddings. The method was trained on 137,000 sequence-structure pairs with a maximum of 256 residues, and is able to recreate contact map patterns with low resolution from primary sequence alone. There is no paper on this yet, but we’ll be looking out for it!

1. de Oliveira, S.H. and Deane, C.M., 2018. Exploring Folding Features in Protein Structure Prediction. Biophysical Journal, 114(3), p.36a.
2. Bowman, M.A. and Clark, P.L., 2018. Folding Proteins From One End to the Other. Biophysical Journal, 114(3), p.200a.
3. Baker, T.A. and Sauer, R.T., 2012. ClpXP, an ATP-powered unfolding and protein-degradation machine. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1823(1), pp.15-28.
Acta (BBA) – Molecular Cell Research, 2012, 1823 (1), 15-28
4. Sander, I.M., Chaney, J.L. and Clark, P.L., 2014. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society, 136(3), pp.858-861.
5. Postic, G., Périn, C., Ghouzam, Y. and Gelly, J.C., 2018. An Ambiguous View of Protein Architecture. Biophysical Journal, 114(3), p.46a.
6. Postic, G., Ghouzam, Y., Chebrek, R. and Gelly, J.C., 2017. An ambiguity principle for assigning protein structural domains. Science advances, 3(1), p.e1600552.
7. Gelly, J.C. and de Brevern, A.G., 2010. Protein Peeling 3D: new tools for analyzing protein structures. Bioinformatics, 27(1), pp.132-133.
8. Liu, K., Maciuba, K. and Kaiser, C.M., 2018. Dual Function of the Trigger Factor Chaperone in Nascent Protein Folding. Biophysical Journal, 114(3), p.552a.
9. Liu, K., Rehfus, J.E., Mattson, E. and Kaiser, C., 2017. The ribosome destabilizes native and non‐native structures in a nascent multi‐domain protein. Protein Science.
10. Karydis, T. and Jacobson, J.M., 2018. Predicting Protein Contact Maps Directly from Primary Sequence without the Need for Homologs. Biophysical Journal, 114(3), p.36a.

Latexing with gvim

Here I’ll share my set-up for writing Latex with gvim instead of a separate Latex editor. If you are text-editor averse, this blog post is not for you. But if, like me, you love vim and hate useless GUIs, this might be helpful.

We’re lucky to have nice big screens in the Stats Department, but I tend to prefer writing on my MacBook (I find it’s easier to transport to e.g. a cafe, my home, etc). Until now, I’ve been happily using TexMaker for writing, but during a recent period of intense Latexing I started to find the useable screen space oppressively small. The unnecessary GUI had to go.

No offence TexMaker but I don’t like you

One of our good friends in Statistical Genetics recommended some things to help me with the transition to just using good old (g)vim, which I will now recommend to you.

The key thing is the LaTex-Box plug-in for vim, which gives you the compilation commands, as well as the essentials such as smart indentation, highlight matching, command completion, etc. I used pathogen to install it (see the GitHub for instructions).

Of course, you can then customise your .vimrc file to add more helpful things. This can be the simple preferences, such as using a light background when using gvim:

if has(“gui_running”)

        set background=light

endif

You can also do more complicated magic like tabbing through available commands, and the ability to minimise sections, etc. Sidenote: to make working with paragraphs easier, I recommend setting the up/down arrows to move the cursor to the next line in the GUI rather than the next actual line. I prefer overriding this behaviour only in gvim, while leaving the normal behaviour in vim (for actual coding). But each to their own.

To get started, open a .tex file, then compile and view the document with the command Latexmk.

Command suggestions are an example of a magical feature added in .vimrc

The configurations for this command are set in the file .latexmkrc. Mine looks like this:

$recorder = 1;
$pdf_mode = 1;
$bibtex_use = 2;
$pdflatex = "pdflatex --shell-escape %O %S";
$pdf_previewer = "start open -a skim %O %S";

My pdf viewer of choice on Mac is Skim, which autoupdates. I view the source and preview at the same time using split view. Please admire the beauty below:

Wow what a beautiful screen

My favourite part is that whenever you save (w), it recompiles and updates the preview. As someone who accidentally types :w everywhere that isn’t vim, it’s nice that this is now productive. It also recompiles automatically if the .bib file is updated. Note that if you have errors at compilation (I’m sure you don’t), you can view them with the command LatexErrors.

Now you too can be a (nearly) GUI-free lightweight Latexer. Enjoy!

Protein Structure Classification: Order in the Chaos

The number of known protein structures has increased exponentially over the past decades; there are currently over 127,000 structures deposited in the PDB [1]. To bring order to this large volume of data, and to further our understanding of protein function and evolution, these structures are systematically classified according to sequence and structural similarity. Downloadable classification data can be used for annotating datasets, exploring the properties of proteins and for the training and benchmarking of new methods [2].

Yearly growth of structures in the PDB (adapted from [1])

Typically, proteins are grouped by structural similarity and organised using hierarchical clustering. Proteins are sorted into classes based on overall secondary structure composition, and grouped into related families and superfamilies. Although this process could originally be manually curated, as with Structural Classification of Proteins (SCOP) [3] (last updated in June 2009), the growing number of protein structures now requires semi- or fully-automated methods, such as SCOP-extended (SCOPe) [4] and Class, Architecture, Topology, Homology (CATH) [5]. These resources are comprehensive and widely used, particularly in computational protein research. There is a large proportion of agreement between these databases, but subjectivity of protein classification is to be expected. Variation in methods and hierarchical structure result in differences in classifications. For example, different criteria for defining and classifying domains results in inconsistencies between CATH and SCOPe.

The arrangements of secondary structure elements in space are known as folds. As a result of evolution, the number of folds that exist in nature is thought to be finite, predicted to be between 1000-10,000 [6]. Analysis of currently known structures appears to support this hypothesis, although solved structures in the PDB are likely to be a skewed sample of all protein structures space. Some folds are extremely commonly observed in protein structures.

In his ‘periodic table for protein structures’, William Taylor went one step further in his goal to find a comprehensive, non-hierarchical method of protein classification [7]. He attempted to identify a minimal set of building blocks, referred to as basic Forms, that can be used to assemble as many globular protein structures as possible. These basic Forms can be combined systematically in layers in a way analogous to the combination of electrons into valence shells to form the periodic table. An individual protein structure can then be described as the closest matching combination of these basic Forms. Related proteins can be identified by the largest combination of basic Forms they have in common.

The ‘basic Forms’ that make up Taylor’s ‘periodic table of proteins’. These secondary structure elements accounted for, on average, 80% of each protein in a set of 2,230 structures (all-alpha proteins were excluded from the dataset) [7]

The classification of proteins by sequence, secondary and tertiary structure is extensive. A relatively new frontier for protein classification is the quaternary structure: how proteins assemble into di-, tri- and multimeric complexes. In a recent publication by an interdisciplinary team of researchers, an analysis of multimeric protein structures in combination with mass spectrometry data was used to create a ‘periodic table of protein complexes’ [8]. Three main types of assembly steps were identified: dimerisation, cyclisation and heteromeric subunit addition. These types are systematically combined to predict many possible topologies of protein complexes, within which the majority of known complexes were found to reside. As has been the case with tertiary structure, this classification and exploration of of quaternary structure space could lead to a better understanding of protein structure, function and evolutionary relationships. In addition, it may inform the modelling and docking of multimeric proteins.

RCSB PDB Statistics
Fox, N.K., Brenner, S.E., Chandonia, J.-M., 2015. The value of protein structure classification information-Surveying the scientific literature. Proteins Struct. Funct. Bioinforma. 83, 2025–2038.
Murzin AG, Brenner SE, Hubbard T, Chothia C., 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 247, 536–540.
Fox, N.K., Brenner, S.E., Chandonia, J.-M., 2014. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, 304-9.
Dawson NL, Lewis TE, Das S, et al., 2017. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Research. 45, 289-295.
Derek N Woolfson, Gail J Bartlett, Antony J Burton, Jack W Heal, Ai Niitsu, Andrew R Thomson, Christopher W Wood,. 2015. De novo protein design: how do we expand into the universe of possible protein structures?, Current Opinion in Structural Biology, 33, 16-26.
Taylor, W.R., 2002. A “periodic table” for protein structures. Nature. 416, 657–660.
Ahnert, S.E., Marsh, J.A., Hernandez, H., Robinson, C. V., Teichmann, S.A., 2015. Principles of assembly reveal a periodic table of protein complexes. Science. 80, 350

Start2Fold: A database of protein folding and stability data

Hydrogen/deuterium exchange (HDX) experiments are used to probe the tertiary structures and folding pathways of proteins. The rate of proton exchange between a given residue’s backbone amide proton and the surrounding solvent depends on the solvent exposure of the residue. By refolding a protein under exchange conditions, these experiments can identify which regions quickly become solvent-inaccessible, and which regions undergo exchange for longer, providing information about the refolding pathway.

Although there are many examples of individual HDX experiments in the literature, the heterogeneous nature of the data has deterred comprehensive analyses. Start2Fold (Start2Fold.eu) [1] is a curated database that aims to present protein folding and stability data derived from solvent-exchange experiments in a comparable and accessible form. For each protein entry, residues are classified as early/intermediate/late based on folding data, or strong/medium/weak based on stability data. Each entry includes the PDB code, length, and sequence of the protein, as well as details of the experimental method. The database currently includes 57 entries, most of which have both folding and stability data. Hopefully, this database will grow as scientists add their own experimental data, and reveal useful information about how proteins refold.

The folding data available in Start2Fold is visualised in the figure below, with early, intermediate and late folding residues coloured light, medium and dark blue, respectively.

[1] Pancsa, R., Varadi, M., Tompa, P., Vranken, W.F., 2016. Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability. Nucleic Acids Res. 44, D429-34.

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Clare West

Combining Inset Plots with Facets using ggplot2

A Brief Introduction to ggpairs

Not-Proteins in Parliament

Introduction to R Markdown

Biophysical Society 62nd Annual Meeting

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

Latexing with gvim

Protein Structure Classification: Order in the Chaos

Start2Fold: A database of protein folding and stability data

Folding proteins from one end to the other Micayla A. Bowman, Patricia L. Clark [2]

An ambiguous view of protein architecture Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

Dual function of the trigger factor chaperone in nascent protein folding Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

Predicting protein contact maps directly from primary sequence without the need for homologs Thrasyvoulos Karydis, Joseph M. Jacobson [10]

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]