Monthly Archives: July 2018

Seeing the Mesoscale

There’s a range of scales that is really hard for us to see. Techniques like X-ray crystallography and increasingly, cryo-electron microscopy, let us see molecules to atomic level-of-detail. Microscopes reveal organelles in cells, but seeing the molecular ‘trees’ in the cellular ‘forest’ requires a synthesis of knowledge. David Goodsell was one of the first to show us the emergent beauty of the cell at the molecular level, and work carried out in the Molecular Graphics Laboratory at The Scripps Research Institute under the direction of Art Olson has led to a 3D molecular modeling tools like ePMV, autoPACK and cellPACK.

One of the fruits of this labor is the Visual Guide to the Cell, part of the Allen Cell Explorer. It’s well worth a look at how you can explore 3D representations of the cell in a web browser.

Covariate Shift in Virtual Screening

In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e $P_{train}(x,y) = P_{test}(x,y)$ . However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons, the chemist changes their focus on a new, structurally distinct series later on and we would not expect the model to accurately predict the labels in the testing sets. Here, we introduce some methods to address this problem.

Methods such as Kernel Mean Matching (KMM) or Kullback-Leibler Importance Estimation Procedure (KLIEP) have been proposed. These methods typically assume the concept remain unchanged and only the distribution changes, i.e. $P_{train}(y|x) =P_{test}(y|x)$ and $P_{train}(x) \neq P_{test}(x)$ . In general, these methods reweight instances in the training data so that the distribution of training instances is more closely aligned with the distribution of instances in the testing set. The appropriate importance weighting factor $w(x)$ for each instance x in the training set is:

$w(x) = \frac{p_{test}(x)}{p_{train}(x)}$

where $p_{train}(x)$ is the training set density and $p_{test} (x)$ is the testing set density. Note that only the feature vector values (not their labels) are used in reweighting. The major difference between KMM and KLIEP is the objective function: KLIEP is based on the minimisation of the Kullback-Leibler divergence while KMM is based on the minimisation of Maximum Mean Discrepancy (MMD). For more detail, please see reference.

Reference:

Masashi Sugiyama ,Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, Motoaki Kawanabe.: Direct importance estimation for Covariate Shift Adaptation. Ann Inst Stat Math. 2008
Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.:Correcting Sample Selection Bias by Unlabeled Data. NIPS 06.
Mcgaughey, Georgia ; Walters, W Patrick ; Goldman, Brian.: Understanding covariate shift in model performance. F1000Research, 2016,

Protein Engineering and Structure Determination

Sometimes it can be advantageous to combine two proteins into one. One such technique was described by Jennifer Padilla, Christos Colovos, and Todd Yeates back in 2001 (Padilla, et al., 2001). By connecting two proteins, one that dimerized, and another that trimerized, they were able to design synthetic ‘nanohedra’. The way they achieved this was by extending a C-terminal α-helix at the end of one protein by another α-helix ‘linker’, directly into the N-terminal α-helix of another protein:

Continue reading →

Introduction to R Markdown

Two of our esteemed OPIGlets presented a workshop on collaborative research using Jupyter Notebook this week at ISMB in Chicago. Their workshop highlights the importance of finding ways to share your work conveniently and reproducibly. So on a related note, I thought I would share a brief introduction to another useful tool, R Markdown with RStudio, which I use to present updates to various supervisors and to remember what I did three months (or three days) ago. This method of sharing work is highly readable, reproducible, and narrative-driven.

I use R for much of my data analysis and all of my visualisation, and I count the tidyverse among my most beloved friends. If you’re so inclined, it’s easy to execute python, bash, and more from within R Markdown. You also don’t need to use RStudio to use R Markdown, but that’s a whole other story.

Starting a new markdown file in RStudio will generate a template script explaining most of what you need to know. If I showed you that then I’d be out of a blog post, but I will at least link to the R Markdown Reference Guide.

R Markdown files consist of text written in markdown, and code chunks that can be individually executed and displayed inline within RStudio. To “knit” the whole thing together, the knitr package is used to execute and combine code chunks, then pandoc converts the whole thing into an attractive document.

Here’s an example. The metadata at the top sets up the document. I’ll be generating an html document here, but notice some other tempting examples commented out. Yes, you can use it for Latex (swoon). You can even make a Word document, but really, why would you?

---
title: "Informative Title"
author: "Clare E. West"
date: "10/07/2018"
output: html_document
#output: beamer_presentation
#output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
```

## Big Title
### Smaller title

R Markdown scripts have the extension .Rmd

R Markdown is __so__ *fun*. You can read all about it [here](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).

```{r}
print("Hello world")
```

Notice that chunks are enclosed within three backticks, with the language and options in braces. Single commands can be executed inline using single backticks.

As highlighted in the example above, global options are set like this:

knitr::opts_chunk$set(echo = TRUE)

“echo=TRUE” means that the code in each chunk is displayed in the final product; this is useful to show collaborators (or your future self) exactly how you did something. Change this option (“echo = FALSE”) globally or in individual blocks to prevent code from printing. This is useful to hide uninteresting commands, or when presenting to people who don’t have the time or inclination to read your code (hard to imagine). Notice I’ve also used “include = FALSE” for the library-loading code chunk, which means evaluate but don’t include in the output. Another useful option is “eval = FALSE”, which means don’t even run this chunk.

So let’s see what that looks like when we render it:

The above example output as HTML

The above example output as Latex

Plots generated in code chunks or images from other sources can be embedded. Set the width in the options. “fig.width” sets the width (in inches) of the figure generated, while “out.width” scales the image in the final documents, for which the units will depend on the document type. Within RStudio, these are previewed inline below the code chunk.

## Including plots/images
```{r fig.width = 4, fig.height = 3, out.width = "400px", echo=FALSE}
t  %>% group_by(Tour, Winner, N, Tournament) %>% filter(WRank <= 20) %>% summarise(WPts = max(WPts))  %>% ggplot(aes(x=N, y=WPts, group=Winner, colour=(Winner=="Murray A."))) + geom_point() + geom_line() + labs(x="Tournament Number",y="Ranking Points") + scale_colour_discrete("",labels=c("Not Andy Murray", "Andy Murray")) + theme_bw() + theme(legend.position = "bottom", legend.margin = margin(0, 0, 0, 0))
knitr::include_graphics("https://s.yimg.com/ny/api/res/1.2/69ZUzNSMYb09GKd8CNJeew--~A/YXBwaWQ9aGlnaGxhbmRlcjtzbT0xO3c9ODAwO2g9NjAw/http://media.zenfs.com/en_us/News/afp.com/0102e1f7d0d3c35303c8a62d56a5eb79c2c8b4d8.jpg")
```

Rather than just printing data R-style, you can nicely format it into a table using kable (part of knitr). I also style mine using kableExtra, which makes it look nice and gives you extra options. By default tables fill the full width, you can override this using e.g. kable_styling(full.width = FALSE, position = “left”). When making a latex document, use kable(table, booktabs = T, “latex”) to get a (reproducible) latex-style table.

Here’s how to use python and bash. Thanks to the package reticulate, you can even share objects between your R and Python chunks. Exclude reticulate (knitr::opts_chunk$set(python.reticulate=FALSE) if you prefer to keep your languages separate.

### Mix it up with python
```{python}
a='Wow python'
print(a.split()[0])
```

What a wild ride. 

### or bash

```{bash, echo=TRUE}
ls | head 
```

Oh look, there's our output, ready to share.

Finally, if you hate GUIs – and you know I do – you can ditch the interactive notebook part and just generate documents from R Markdown files like this:

rmarkdown::render("BlogExample.Rmd")

Maps are useful. But first, you need to build, store and read a map.

Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help.

In C++, map is one of the ways you can store a string-key and an integer value. Since we are concerned about the data storage and access, I compared map and unordered_map.

An unordered_map stores a hash table of the keys and the mapped value; while a map is ordered. The important consideration here includes:

Memory: map does not have the hash table and is therefore smaller than an unordered_map.
Access: accessing an unordered_map takes O(1) while accessing a map takes log(n).

I have eventually chosen to go with map, because it is more memory efficient considering the small RAM size that I have access to. However, it still takes up about 8GB of RAM per object during its runtime (and I have 1800 objects to run through, each building a different dictionary). Saving these seems to open another can of worm.

In Python, we could easily use Pickle or JSON to serialise the dictionary. In C++, it’s common to use the BOOST library. There are two archival functions in BOOST: text or binary archives. Text archives are human-readable but I don’t think I am really going to open and read 10M+ lines of key-value pairs, I opted for binary archives that are machine readable and smaller. (Read more: https://stackoverflow.com/questions/1058051/boost-serialization-performance-text-vs-binary-format .)

To further compress the memory size when I save the maps, I used zlib compression. Obviously there are ready-to-use codes from these people half a year ago, which saved me debugging:

https://stackoverflow.com/questions/48034374/compressing-boost-binary-archive-with-zlib

Ultimately this gets down to 96GB summing 1800 files, all done within 6 hours.

Oxford Protein Informatics Group

or "OPIG" to friends