Monthly Archives: March 2018

Working with Jupyter notebook on a remote server

To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.

Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.

Here’s how to do it.

1. First, connect to the remote server if you haven’t already

ssh fergus@funkyserver

1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py which by default lives in ~/.jupyter. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password argument:

jupyter notebook password
>>> Enter password:

This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.

2: Launch a Jupyter session on the remote server. You can specify the access port using the --port option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.

jupyter lab --port=9000 --no-browser &

Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.

3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:

ssh -N -f -L 8888:localhost:9000 fergus@funkyserver

For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.

-N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
-f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
-L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass -L localhost:8888:localhost:9000.

4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888 into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.

Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.

Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.

Dealing with indexes when processing co-evolution signals (or how to navigate through “sequence hell”)

Co-evolution techniques provide a powerful way to extract structural information from the wealth of protein sequence data that we now have available. These techniques are predicated upon the notion that residues that share spatial proximity in a protein structure will mutate in a correlated fashion (co-evolve). This co-evolution signal can be inferred from a multiple sequence alignment, which tells us a bit about the evolutionary history of a particular protein family. If you want to have a better gauge at the power of co-evolution, you can refer to some of our previous posts (post1, post2).

This is more of a practical post, where I hope to illustrate an indexing problem (and how to circumvent it) that one commonly encounters when dealing with co-evolution signals.

Most of the co-evolution tools available Today output pairs of residues (i,j) that were predicted to be co-evolving from a multiple sequence alignment. One of the main applications of these techniques is to predict protein contacts, that is pairs of residues that are within a predetermined distance (quite often 8Å). Say you want to compare the precision of different co-evolution methods for a particular test set. Your test set would consist of a number of proteins for which the structure is known and for which sufficient sequence information is available for the contact prediction to be carried out. Great!

So you start with your target sequences, generate a number of contact predictions of the form (i,j) for each sequence and, for each pair, you check if the ith and jth residues are in contact (say, less than 8Å apart) on the corresponding known protein structure. If you actually carry out this test, you will obtain appalling precision for a large number of test cases. This is due to an index disparity that a friend of mine quite aptly described as “sequence hell”.

This indexing disparity occurs because there is a mismatch between the protein sequence that was used to produce the contact predictions and the sequence of residues that are represented in a protein structure. Ask a crystallographer friend if you have one, and you will find that in the process of resolving a protein’s structure experimentally, there are many reasons why residues would be missing in the final structure. More so, there are even cases where residues had to be added to enable protein expression and/or crystallisation. This implies that the protein sequence (represented by a fasta file) frequently has more (and sometimes fewer) residues than the proteins structure (represented by a PDB file). This means that if the ith and jth residues in your sequence were predicted to be in contact, that does not mean that they correspond to the ith and jth residues in order of appearance in your protein structure. So what do we do now?

A true believer in the purity and innocence of the world would assume that the SEQRES entries in your PDB file, for instance, would come to the rescue. The SEQRES describes the sequence of residues exactly as they appear on the atom coordinates of a particular PDB file. This would be a great way of mitigating the effects of added or altered residues, and would potentially mitigate the effects of residues that were not present in the construct. In other words, the sequences described by SEQRES would be a good candidate to validate whether your predicted contacts are present in the structure. They do, however, contain one limitation. SEQRES also describe any residues whose coordinates were missing in the PDB. This means that if you process the PDB sequentially and that some residues could not be resolved, the ith residue to appear on the PDB could be different to the ith residue in the SEQRES.

An even more innocent person, shielded from all the ugliness of the the universe, would simply hope that the indexing on the PDB is correct, i.e. that one can use the residue indexes presented on the “6th column” of the ATOM entries and that these would match perfectly to the (i,j) pair you obtained using your protein sequence. While, in theory, I believe this should be the case, in my experience this indexing is often incorrect and more frequently than not, will lead to errors when validating protein contacts.

My solution to the indexing problem is to parse the PDB sequentially and extract the sequence of all the residues for which coordinates are actually present. To my knowledge, this is the only true and tested way of obtaining this information. If you do that, you will be armed with a sequence and indexing that correctly represent the indexing of your PDB. From now on, I will refer to these as the PDB-sequence and PDB-sequence indexing.

All that is left is to find a correspondence (a mapping) between the sequence you used for the contact prediction and the PDB-sequence. I do that by performing a standard (global) sequence alignment using the Needleman–Wunsch algorithm. Once in possession of such an alignment, the indexes (i,j) of your original sequence can be matched to adjusted indexes (i',j') on your PDB-sequence indexing. In short, you extracted a sequential list of residues as they appeared on the PDB, aligned these to the original protein sequence, and created a new set of residue pairings of the form (i',j') which are representative of the indexing in PDB-sequence space. That means that the i’th residue to appear on the PDB was predicted to be in contact with the j’th residue to appear.

The problem becomes a little more interesting when you hope to validate the contact predictions for other proteins with known structure in the same protein family. A more robust approach is to use the sequence alignment that is created as part of the co-evolution prediction as your basis. You then identify the sequence that best represents the PDB-sequence of your homologous protein by performing N global sequence alignments (where N is the number of sequences in your MSA), one per entry of the MSA. The highest scoring alignment can then be used to map the indexing. This approach is robust enough that if your homologous PDB-sequence of interest was not present in the original MSA for whatever reason, you should still get a sensible mapping at the end (all limitations of sequence alignment considered).

One final consideration should be brought to the reader’s attention. What happens if, using the sequence as a starting point, one obtains one or more (i,j) pairs where either i or j is not resolved/present in the protein structure? For validation purposes, often these pairs are disregards. Yet, what does this co-evolutionary signal tell us about the missing residues in the structure? Are they disordered/flexible? Could the co-evolution help us identify low occupancy conformations?

I’ll leave the reader with these questions to digest. I hope this post proves helpful to those braving the seas of “sequence hell” in the near future.

Interactive Illustration of Collaboration Networks with D3

D3 is a JavaScript Library that allows the creation of interactive data visualisations. D3 stands for Data-Driven Documents and its advantage is that it is using internet standards as HTML, CSS, and SVG as the foundation. This gives maximal compatibility across all moderns browsers. It is widely used by journalists, data scientists, and starts to be used by academic scientists, too.

Here I want to present a simple way of creating interactive illustrations of collaboration networks, e.g., for your own website. Before going into details, have a look at the example below, it presents some of Rosalind Franklin’s papers and her co-authors.

[d3-source canvas=”wpd3-3903-2″]

The network illustration is a so-called bipartite network because it consists of two types of nodes, blue nodes represent scientists and orange nodes represent publications. These nodes are connected if a scientist is an author on a publication. This way of presenting is a so-called force layout, which simulates repelling Coulomb forces between all nodes and spring-like attractive forces act on nodes that are connected via links. You can drag the nodes around and explore the behaviour of the network.

You will have noticed that you can also see the name of the publications and co-authors when you hover over the nodes. For some of them, a representative figure or photograph is shown, too. You can also double-click the nodes and will be directed to the webpage of the publication or author.

For larger collaboration networks the visualisation is still possible but might get a bit messy. Thus, not showing any figures is advised. Furthermore, the name of the nodes should be shown either below the illustration or as a tooltip (the later is not possible in WordPress but on normal web pages as here). The illustration below shows all 124 publications written by OPIG, together with the 310 authors. Here, I had to remove the Head of the Group Charlotte Deane, as she is on almost all of these papers and would clutter the illustration, unfortunately. In network science, we call such a node an ego node.

[d3-source canvas=”wpd3-3903-0″]

If you want to play around with this, check out my Github repository. The creation of the network is fairly simple. You only need a Bibtex file and can then execute a Python script that creates a JSON file that is read into the JavaSript. This can then be incorporated in all web pages easily.

PS: To enable D3 in WordPress you need a special plugin. Apparently, it is not up to date with the current stable D3 version 4 but you can load any missing functions manually.

Fun with Proteins and 3D Printing!

When I’m not postdoc-ing, as part of my job I’m also involved with teaching at the Doctoral Training Centre here in Oxford. I mainly teach the first-year students of the Systems Approaches to Biomedical Science CDT – many members of this group are doing (or have done) their DPhils through this program (including myself!). Recently, I and some other OPIGlets were responsible for two modules called Structural Biology and Structure-Based Drug Discovery, and as part of those modules we arranged a practical session on 3D printing.

Most of the time, the way we ‘see’ protein structures is through a computer screen, using visualisation software such as PyMOL. While useful, these virtual representations have their limitations – since the screen is flat, it’s difficult to get a proper feel for the structure¹, and seeing how your protein could interact and form assemblies with others is difficult. Physical, three-dimensional models, on the other hand, allow you to get hands-on with your structure, and understand aspects of your protein that couldn’t be gained from simply looking at images. Plus, they look pretty cool!

This year, I printed three proteins for myself (shown in the photo above). Since my most recent work has focused on transmembrane proteins, I felt it was only right to print one – these are proteins that cross membranes, usually to facilitate the transport of molecules in and out of the cell. I chose the structure of a porin (top of the photo), which (as the name suggests) forms a pore in the cell membrane to allow diffusion across it. This particular protein (1A0S) is a sucrose-specific porin from a type of bacteria called Salmonella typhimurium, and it has three chains (coloured blue, pink and purple in the printed model), each of which has a beta barrel structure. You can just about see in the photo that each chain has regions which are lighter in colour – these are the parts that sit in the cell membrane layer; the darker regions are therefore the parts that stick out from the membrane.

My second printed model was the infamous Zika virus (bottom right). Despite all the trouble it has caused in recent years, in my opinion the structure of the Zika virus is actually quite beautiful, with the envelope proteins forming star-like shapes in a highly symmetrical pattern. This sphere of proteins contains the viral RNA. The particular structure I used to create the model (5IRE) was solved using cryo-electron microscopy, and required aligning over 10,000 images of the virus.

Finally, I printed the structure of a six-residue peptide, that’s probably only interesting to me… Can you tell why?!²

1 – However, look at this link for an example of looking at 3D structures using augmented reality!

2 – Hint: Cysteine, Leucine, Alanine, Isoleucine, Arginine, Glutamic Acid…

Teaching Network Science to High School Students

In the recent years, a lot of effort went into outreach events, in particular for science and mathematics. Here, I am going to mention a summer course on Network Science which I organized and taught together with Benjamin F. Maier from the Humboldt University Berlin.

The course was part of an established German summer school called Deutsche Schülerakademie (German Pupils Academy), an extracurricular event for highly motivated pupils. It lasts sixteen days and the participants join one of six courses, which cover all ranges of academic disciplines, from philosophy over music to science.

Our course was titled Netzwerke und Komplexe Systeme (Networks and Complex Systems) and rather than going too much in depth in one particular area we covered a broad selection of topics, as we wanted to give students an overview and also an idea of how different disciplines approach complex phenomena. We discussed pure Mathematics topics as the colouring of graphs, algorithmic discussions as the travelling salesman problem, social network analysis, computational neuroscience, dynamical systems, and fractals.

A network of the former monastery in Rossleben, where the summer school was held. The students created the network themselves. To parallelise the task they split up into four groups, each covering one level of the building. They then used this network to simulate the spread of a contagious disease, starting at the biological lab (A35, in red).

A couple of thoughts on what went well and which parts might need improvement for further of such events:

We did a questionnaire before and asked the pupils some questions like “Do you know what a vector is?” and also concerning their motivation to join the course. This was very helpful in getting a rough idea about their knowledge level.
We gave them some material to read before the course. In retrospective, it probably would be better to give them something to read, as well as, some problems to solve, such that the learning outcome is clearer and more effective.
The students gave presentations on topics we choose for them based on their answers to the questionnaire. The presentations were good but a lot of students overrun the allocated time because they were very enthusiastic about the topics.
The students were also enthusiastic about the programming exercises, for which we used Python and the NetworkX library. One challenge was the heterogeneity in programming experience, this made the splitting up into two groups, beginner and advanced, necessary.
In contrast to courses covering similar topics at university-level, the students did not have the necessary mathematical background for the more complicated aspects of network science. Accordingly, it is better to choose less of these and allocate time to introduce the mathematical methods, for example, eigenvectors or differential equations, beforehand.
The students very much liked hand on exercises, for example, the creation of random networks of different connection probabilities with the help of dice or the creation of a network of the floor plan of the building in which the summer school was held, as shown in the Figure.

It was great fun to introduce the students to the topic of network science and I strongly can recommend others to organise similar outreach events! You can find some of our teaching materials, including the worksheets and programming exercises in the original German and a translated English version, online. A paper describing our endeavours is under review.

Four Erdős–Rényi random graphs as generated by the participants by rolling dice. A twenty-sided dice was used for the probabilities p = 1/20 and p = 10 and a six-sided dice for p = 1/6 and p=1/3. This fun exercise allows the discussion of degree distributions, the size of the largest connected component, and similar topics for ER random graphs.

Young Entrepreneurs Scheme (YES) Competition

Fair warning: I’m going to use this BLOPIG post to promote the YES competition and talk about how semi-amazingly we did at it!

For those who don’t know, the YES competition runs yearly and is designed to develop the entrepreneurial spirit amongst graduates and post-graduates. The YES workshops come in two parts, the first being an intensive crash course in small business start-ups. These are delivered by financial experts, successful start-ups, and intellectual property teams. Carefully mixing theory with useful anecdotes, these talks were hugely insightful and all entusiastically given by people passionate about science start-ups. We were lucky to have many of these speakers mentoring for the second part: the development of our own business plan.

Our team, the fantastically named Team SolOx, developed a licence-selling business for a theoretical catalyst, which mimicked photosynthesis. Our product produced lightweight hydrocarbons from atmospheric gases quickly and efficiently. The comic value of our idea aside, we designed a 10 year business plan that saw SolOx develop and licence our catalyst. This process was eye-opening, with the mentors highlighting the hurdles we would face and taught us how to overcome them. Our pitch landed us a place in the final at the Royal Society in London, as one of the winners of the YES Industrial Challenges workshop 2017. Although the final judging panel didn’t find our plan as financially sound as others, we had a fantastic experience and would thoroughly recommend it to anyone interested in business start-ups.

Team SolOx: Winners of the YES Industrial Challenges 2017 Workshop. Left to right: Natasha Rhys, Tom Dixon, Joe Bluck, Sarah-Beth Amos and Alex Skates.

Finally, I would like to thank the Systems Approaches to Medical Science Centre for Doctoral Training for their financial support and their focus on promoting entrepreneurial skills.

Oxford Protein Informatics Group

or "OPIG" to friends