Biophysical Society 62nd Annual Meeting

In February I was very fortunate to attend the Biophysical Society 62nd Annual Meeting, which was held in San Francisco – my first real conference and my first trip to North America. Despite arriving with the flu, I had a great time! The conference took place over five days, during which there were manageable 15-minute talks covering a huge range of Biophysics-related topics, and a few thousand more posters on display (including mine). With almost 6,500 attendees, it was also large enough to slip across the road to the excellent SF Museum of Modern Art without anyone noticing.

The best presentation of the conference was, of course, Saulo’s talk on integrating biological folding features into protein structure prediction [1]. Aside from that, here are a few more of my favourites:

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

Here in the COFFEE (COtranslational Folding Family of Expert Enthusiasts) office, we love to talk about the vectorial nature of cotranslational folding and how it contributes to the efficiency of protein folding in vivo. Micayla Bowman and Patricia Clark have created a novel technique that will allow the effects of this vectorial folding to be investigated specifically in vitro.

The Clp complex grabs, unfolds and degrades proteins (diagram from [3]). ClpX, the translocase unit of this complex, was used to recapitulate vectorial protein refolding in vitro for the first time.

ClpX is an A+++ molecular motor that grabs proteins and translocates them through its pore. In vivo, its role is to denature substrates and feed them to an associated protease (ClpP) [3]. Bowman & Clark have used protein tags to initiate translocation of the target protein through ClpX, resulting in either N-C or C-N vectorial refolding.

The YKB construct used to demonstrate the vectorial folding mediated by ClpX (diagram from [4]).

They demonstrate the effect using YKB, a construct with two mutually exclusive native states: YK-B (fluoresces yellow) and Y-KB (fluoresces blue) [4]. In vitro refolding results in an equal proportion of yellow and blue states. Cotranslational folding, which proceeds in the N-C direction, biases towards the yellow (YK-B) state. C-N refolding in the presence of ClpX and ATP biases towards the blue (Y-KB) state. With this neat assay, they demonstrate that ClpX can mediate vectorial folding in vitro, and they plan to use the assay to investigate its effect on protein folding pathways and yields.

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

This work addresses the ambiguity of domain definition by assigning multiple possible domain boundaries to protein structures. Their automated method, SWORD (Swift and Optimised Recognition of Domains), performs protein partitioning via the hierarchical clustering of protein units (PUs) [7], which are smaller than domains and larger than secondary structures. The structure is first decomposed into protein units, which are then merged depending on the resulting “separation criterion” (relative contact probabilities) and “compactness” (contact density).

Their method is able to reproduce the multiple conflicting definitions that often exist between domain databases such as SCOP and CATH. Additionally, they present a number of cases for which the alternative domain definitions have interesting implications, such as highlighting early folding regions or functional subdomains within “single-domain” structures.

Alternative SWORD domain delineations identify (R) an ultrafast folding domain and (S,T) stable autonomous folding regions within proteins designated single-domain by other methods [6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

The authors of this work used optical tweezers to study the cotranslational folding of the first two domains of 5-domain protein elongation factor G.

In agreement with a number of other presentations at the conference, they report that interactions with the ribosome surface during the early stages of translation slows folding by stabilising disordered states, preventing both native and misfolded conformations. They found that the N-terminal domain (G domain) folds independently, while the subsequent folding of the second domain (Domain II) requires the presence of the folded G domain. Furthermore, while partially extruded, unfolded domain II destabilises the native G domain conformation and leads to misfolding. This is prevented in the presence of the chaperone Trigger factor, which protects the G domain from unproductive interactions and unfolding by stabilising the native conformation. This work demonstrates interesting mechanisms by which Trigger factor and the ribosome can influence the cotranslational folding pathway.

Optical tweezers are used to interrogate the folding pathway of a protein during stalled cotranslational folding. Mechanical force applied to the ribosome and the N-terminal of the nascent chain causes unfolding events, which can be identified as sudden increases in the extension of the chain. (Figure from [9])

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

The prediction of protein contacts from primary sequence is an enormously powerful tool, particularly for predicting protein structures. A major limitation is that current methods using coevolution inference require a large multiple sequence alignment, which is not possible for targets without many known homologous sequences.

In this talk, Thrasyvoulos Karydis presented CoMET (Convolutional Motif Embeddings Tool), a tool to predict protein contact maps without a multiple sequence alignment or coevolution data. They extract structural and sequence motifs from known sequence-structure pairs, and use a Deep Convolutional Neural Network to associate sequence and structure motif embeddings. The method was trained on 137,000 sequence-structure pairs with a maximum of 256 residues, and is able to recreate contact map patterns with low resolution from primary sequence alone. There is no paper on this yet, but we’ll be looking out for it!


1. de Oliveira, S.H. and Deane, C.M., 2018. Exploring Folding Features in Protein Structure Prediction. Biophysical Journal, 114(3), p.36a.
2. Bowman, M.A. and Clark, P.L., 2018. Folding Proteins From One End to the Other. Biophysical Journal, 114(3), p.200a.
3. Baker, T.A. and Sauer, R.T., 2012. ClpXP, an ATP-powered unfolding and protein-degradation machine. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1823(1), pp.15-28.
Acta (BBA) – Molecular Cell Research, 2012, 1823 (1), 15-28
4. Sander, I.M., Chaney, J.L. and Clark, P.L., 2014. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society, 136(3), pp.858-861.
5. Postic, G., Périn, C., Ghouzam, Y. and Gelly, J.C., 2018. An Ambiguous View of Protein Architecture. Biophysical Journal, 114(3), p.46a.
6. Postic, G., Ghouzam, Y., Chebrek, R. and Gelly, J.C., 2017. An ambiguity principle for assigning protein structural domains. Science advances, 3(1), p.e1600552.
7. Gelly, J.C. and de Brevern, A.G., 2010. Protein Peeling 3D: new tools for analyzing protein structures. Bioinformatics, 27(1), pp.132-133.
8. Liu, K., Maciuba, K. and Kaiser, C.M., 2018. Dual Function of the Trigger Factor Chaperone in Nascent Protein Folding. Biophysical Journal, 114(3), p.552a.
9. Liu, K., Rehfus, J.E., Mattson, E. and Kaiser, C., 2017. The ribosome destabilizes native and non‐native structures in a nascent multi‐domain protein. Protein Science.
10. Karydis, T. and Jacobson, J.M., 2018. Predicting Protein Contact Maps Directly from Primary Sequence without the Need for Homologs. Biophysical Journal, 114(3), p.36a.

Working with Jupyter notebook on a remote server

To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.

Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.

Here’s how to do it.

1. First, connect to the remote server if you haven’t already

ssh fergus@funkyserver

1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py which by default lives in ~/.jupyter. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password argument:

jupyter notebook password
>>> Enter password:

This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.

2: Launch a Jupyter session on the remote server. You can specify the access port using the --port option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.

jupyter lab --port=9000 --no-browser &

Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.

3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:

ssh -N -f -L 8888:localhost:9000 fergus@funkyserver

For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.

  • -N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
  • -f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
  • -L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass -L localhost:8888:localhost:9000.

4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888 into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.

Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.

Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.

Dealing with indexes when processing co-evolution signals (or how to navigate through “sequence hell”)

Co-evolution techniques provide a powerful way to extract structural information from the wealth of protein sequence data that we now have available. These techniques are predicated upon the notion that residues that share spatial proximity in a protein structure will mutate in a correlated fashion (co-evolve). This co-evolution signal can be inferred from a multiple sequence alignment, which tells us a bit about the evolutionary history of a particular protein family. If you want to have a better gauge at the power of co-evolution, you can refer to some of our previous posts (post1, post2).

This is more of a practical post, where I hope to illustrate an indexing problem (and how to circumvent it) that one commonly encounters when dealing with co-evolution signals.

Most of the co-evolution tools available Today output pairs of residues (i,j) that were predicted to be co-evolving from a multiple sequence alignment. One of the main applications of these techniques is to predict protein contacts, that is pairs of residues that are within a predetermined distance (quite often 8Å).  Say you want to compare the precision of different co-evolution methods for a particular test set. Your test set would consist of a number of proteins for which the structure is known and for which sufficient sequence information is available for the contact prediction to be carried out. Great!

So you start with your target sequences, generate a number of contact predictions of the form (i,j) for each sequence and, for each pair, you check if the ith and jth residues are in contact (say, less than 8Å apart) on the corresponding known protein structure. If you actually carry out this test, you will obtain appalling precision for a large number of test cases. This is due to an index disparity that a friend of mine quite aptly described as “sequence hell”.

This indexing disparity occurs because there is a mismatch between the protein sequence that was used to produce the contact predictions and the sequence of residues that are represented in a protein structure. Ask a crystallographer friend if you have one, and you will find that in the process of resolving a protein’s structure experimentally, there are many reasons why residues would be missing in the final structure. More so, there are even cases where residues had to be added to enable protein expression and/or crystallisation. This implies that the protein sequence (represented by a fasta file) frequently has more (and sometimes fewer) residues than the proteins structure (represented by a PDB file).  This means that if the ith  and jth residues in your sequence were predicted to be in contact, that does not mean that they correspond to the ith and jth residues in order of appearance in your protein structure. So what do we do now?

A true believer in the purity and innocence of the world would assume that the SEQRES entries in your PDB file, for instance, would come to the rescue. The SEQRES describes the sequence of residues exactly as they appear on the atom coordinates of a particular PDB file. This would be a great way of mitigating the effects of added or altered residues, and would potentially mitigate the effects of residues that were not present in the construct. In other words, the sequences described by SEQRES would be a good candidate to validate whether your predicted contacts are present in the structure. They do, however, contain one limitation. SEQRES also describe any residues whose coordinates were missing in the PDB. This means that if you process the PDB sequentially and that some residues could not be resolved, the ith residue to appear on the PDB could be different to the ith residue in the SEQRES.

An even more innocent person, shielded from all the ugliness of the the universe, would simply hope that the indexing on the PDB is correct, i.e. that one can use the residue indexes presented on the “6th column” of the ATOM entries and that these would match perfectly to the (i,j) pair you obtained using your protein sequence. While, in theory, I believe this should be the case, in my experience this indexing is often incorrect and more frequently than not, will lead to errors when validating protein contacts.

My solution to the indexing problem is to parse the PDB sequentially and extract the sequence of all the residues for which coordinates are actually present. To my knowledge, this is the only true and tested way of obtaining this information. If you do that, you will be armed with a sequence and indexing that correctly represent the indexing of your PDB. From now on, I will refer to these as the PDB-sequence and PDB-sequence indexing.

All that is left is to find a correspondence (a mapping) between the sequence you used for the contact prediction and the PDB-sequence. I do that by performing a standard (global) sequence alignment using the Needleman–Wunsch algorithm. Once in possession of such an alignment, the indexes (i,j) of your original sequence can be matched to adjusted indexes (i',j') on your PDB-sequence indexing. In short, you extracted a sequential list of residues as they appeared on the PDB, aligned these to the original protein sequence, and created a new set of residue pairings of the form (i',j') which are representative of the indexing in PDB-sequence space. That means that the i’th residue to appear on the PDB was predicted to be in contact with the j’th residue to appear.

The problem becomes a little more interesting when you hope to validate the contact predictions for other proteins with known structure in the same protein family. A more robust approach is to use the sequence alignment that is created as part of the co-evolution prediction as your basis. You then identify the sequence that best represents the PDB-sequence of your homologous protein by performing N global sequence alignments (where N is the number of sequences in your MSA), one per entry of the MSA. The highest scoring alignment can then be used to map the indexing. This approach is robust enough that if your homologous PDB-sequence of interest was not present in the original MSA for whatever reason, you should still get a sensible mapping at the end (all limitations of sequence alignment considered).

One final consideration should be brought to the reader’s attention. What happens if, using the sequence as a starting point, one obtains one or more (i,j) pairs where either i or j is not resolved/present in the protein structure? For validation purposes, often these pairs are disregards. Yet, what does this co-evolutionary signal tell us about the missing residues in the structure? Are they disordered/flexible? Could the co-evolution help us identify low occupancy conformations?

I’ll leave the reader with these questions to digest. I hope this post proves helpful to those braving the seas of “sequence hell” in the near future.

Interactive Illustration of Collaboration Networks with D3

D3 is a JavaScript Library that allows the creation of interactive data visualisations. D3 stands for Data-Driven Documents and its advantage is that it is using internet standards as HTML, CSS, and SVG as the foundation. This gives maximal compatibility across all moderns browsers. It is widely used by journalists, data scientists, and starts to be used by academic scientists, too.

Here I want to present a simple way of creating interactive illustrations of collaboration networks, e.g., for your own website. Before going into details, have a look at the example below, it presents some of Rosalind Franklin’s papers and her co-authors.

[d3-source canvas=”wpd3-3903-2″]

The network illustration is a so-called bipartite network because it consists of two types of nodes, blue nodes represent scientists and orange nodes represent publications. These nodes are connected if a scientist is an author on a publication. This way of presenting is a so-called force layout, which simulates repelling Coulomb forces between all nodes and spring-like attractive forces act on nodes that are connected via links. You can drag the nodes around and explore the behaviour of the network.

You will have noticed that you can also see the name of the publications and co-authors when you hover over the nodes. For some of them, a representative figure or photograph is shown, too. You can also double-click the nodes and will be directed to the webpage of the publication or author.

For larger collaboration networks the visualisation is still possible but might get a bit messy. Thus, not showing any figures is advised. Furthermore, the name of the nodes should be shown either below the illustration or as a tooltip (the later is not possible in WordPress but on normal web pages as here). The illustration below shows all 124 publications written by OPIG, together with the 310 authors. Here, I had to remove the Head of the Group Charlotte Deane, as she is on almost all of these papers and would clutter the illustration, unfortunately. In network science, we call such a node an ego node.

[d3-source canvas=”wpd3-3903-0″]

If you want to play around with this, check out my Github repository. The creation of the network is fairly simple. You only need a Bibtex file and can then execute a Python script that creates a JSON file that is read into the JavaSript. This can then be incorporated in all web pages easily.

PS: To enable D3 in WordPress you need a special plugin.  Apparently, it is not up to date with the current stable D3 version 4 but you can load any missing functions manually.

Fun with Proteins and 3D Printing!

When I’m not postdoc-ing, as part of my job I’m also involved with teaching at the Doctoral Training Centre here in Oxford. I mainly teach the first-year students of the Systems Approaches to Biomedical Science CDT – many members of this group are doing (or have done) their DPhils through this program (including myself!). Recently, I and some other OPIGlets were responsible for two modules called Structural Biology and Structure-Based Drug Discovery, and as part of those modules we arranged a practical session on 3D printing.

Most of the time, the way we ‘see’ protein structures is through a computer screen, using visualisation software such as PyMOL. While useful, these virtual representations have their limitations – since the screen is flat, it’s difficult to get a proper feel for the structure1, and seeing how your protein could interact and form assemblies with others is difficult. Physical, three-dimensional models, on the other hand, allow you to get hands-on with your structure, and understand aspects of your protein that couldn’t be gained from simply looking at images. Plus, they look pretty cool!

This year, I printed three proteins for myself (shown in the photo above). Since my most recent work has focused on transmembrane proteins, I felt it was only right to print one – these are proteins that cross membranes, usually to facilitate the transport of molecules in and out of the cell. I chose the structure of a porin (top of the photo), which (as the name suggests) forms a pore in the cell membrane to allow diffusion across it. This particular protein (1A0S) is a sucrose-specific porin from a type of bacteria called Salmonella typhimurium, and it has three chains (coloured blue, pink and purple in the printed model), each of which has a beta barrel structure. You can just about see in the photo that each chain has regions which are lighter in colour – these are the parts that sit in the cell membrane layer; the darker regions are therefore the parts that stick out from the membrane.

My second printed model was the infamous Zika virus (bottom right). Despite all the trouble it has caused in recent years, in my opinion the structure of the Zika virus is actually quite beautiful, with the envelope proteins forming star-like shapes in a highly symmetrical pattern. This sphere of proteins contains the viral RNA. The particular structure I used to create the model (5IRE) was solved using cryo-electron microscopy, and required aligning over 10,000 images of the virus.

Finally, I printed the structure of a six-residue peptide, that’s probably only interesting to me… Can you tell why?!2

 

1 – However, look at this link for an example of looking at 3D structures using augmented reality!

2 – Hint: Cysteine, Leucine, Alanine, Isoleucine, Arginine, Glutamic Acid…

Teaching Network Science to High School Students

In the recent years, a lot of effort went into outreach events, in particular for science and mathematics. Here, I am going to mention a summer course on Network Science which I organized and taught together with Benjamin F. Maier from the Humboldt University Berlin.

The course was part of an established German summer school called Deutsche Schülerakademie (German Pupils Academy), an extracurricular event for highly motivated pupils. It lasts sixteen days and the participants join one of six courses, which cover all ranges of academic disciplines, from philosophy over music to science.  

Our course was titled Netzwerke und Komplexe Systeme (Networks and Complex Systems) and rather than going too much in depth in one particular area we covered a broad selection of topics, as we wanted to give students an overview and also an idea of how different disciplines approach complex phenomena. We discussed pure Mathematics topics as the colouring of graphs, algorithmic discussions as the travelling salesman problem, social network analysis, computational neuroscience, dynamical systems, and fractals.

A network of the former monastery in Rossleben, where the summer school was held. The students created the network themselves. To parallelise the task they split up into four groups, each covering one level of the building. They then used this network to simulate the spread of a contagious disease, starting at the biological lab (A35, in red).

A couple of thoughts on what went well and which parts might need improvement for further of such events:

  • We did a questionnaire before and asked the pupils some questions like “Do you know what a vector is?” and also concerning their motivation to join the course. This was very helpful in getting a rough idea about their knowledge level.
  • We gave them some material to read before the course. In retrospective, it probably would be better to give them something to read, as well as, some problems to solve, such that the learning outcome is clearer and more effective.
  • The students gave presentations on topics we choose for them based on their answers to the questionnaire. The presentations were good but a lot of students overrun the allocated time because they were very enthusiastic about the topics.
  • The students were also enthusiastic about the programming exercises, for which we used Python and the NetworkX library. One challenge was the heterogeneity in programming experience, this made the splitting up into two groups, beginner and advanced, necessary.
  • In contrast to courses covering similar topics at university-level, the students did not have the necessary mathematical background for the more complicated aspects of network science. Accordingly, it is better to choose less of these and allocate time to introduce the mathematical methods, for example, eigenvectors or differential equations, beforehand.
  • The students very much liked hand on exercises, for example, the creation of random networks of different connection probabilities with the help of dice or the creation of a network of the floor plan of the building in which the summer school was held, as shown in the Figure.

It was great fun to introduce the students to the topic of network science and I strongly can recommend others to organise similar outreach events! You can find some of our teaching materials, including the worksheets and programming exercises in the original German and a translated English version, online. A paper describing our endeavours is under review.

Four Erdős–Rényi random graphs as generated by the participants by rolling dice. A twenty-sided dice was used for the probabilities p = 1/20 and p = 10 and a six-sided dice for p = 1/6 and p=1/3. This fun exercise allows the discussion of degree distributions, the size of the largest connected component, and similar topics for ER random graphs.

Young Entrepreneurs Scheme (YES) Competition

Fair warning: I’m going to use this BLOPIG post to promote the YES competition and talk about how semi-amazingly we did at it!

For those who don’t know, the YES competition runs yearly and is designed to develop the entrepreneurial spirit amongst graduates and post-graduates. The YES workshops come in two parts, the first being an intensive crash course in small business start-ups. These are delivered by financial experts, successful start-ups, and intellectual property teams. Carefully mixing theory with useful anecdotes, these talks were hugely insightful and all entusiastically  given by people passionate about science start-ups. We were lucky to have many of these speakers mentoring for the second part: the development of our own business plan.

Our team, the fantastically named Team SolOx, developed a licence-selling business for a theoretical catalyst, which mimicked photosynthesis. Our product produced lightweight hydrocarbons from atmospheric gases quickly and efficiently. The comic value of our idea aside, we designed a 10 year business plan that saw SolOx develop and licence our catalyst. This process was eye-opening, with the mentors highlighting the hurdles we would face and taught us how to overcome them. Our pitch landed us a place in the final at the Royal Society in London, as one of the winners of the YES Industrial Challenges workshop 2017. Although the final judging panel didn’t find our plan as financially sound as others, we had a fantastic experience and would thoroughly recommend it to anyone interested in business start-ups.

Team SolOx: Winners of the YES Industrial Challenges 2017 Workshop. Left to right: Natasha Rhys, Tom Dixon, Joe Bluck, Sarah-Beth Amos and Alex Skates.

Finally, I would like to thank the Systems Approaches to Medical Science Centre for Doctoral Training for their financial support and their focus on promoting entrepreneurial skills.

A short intro to machine precision and how to beat it

Most people who’ve ever sat through a Rounding Error is Bad lecture will be familiar with the following example:

> (0.1+0.1+0.1) == 0.3
FALSE

The reason this is so unsettling is because most of the time we think about numbers in base-10. This means we use ten digits \{0, 1, \dots, 9\}, and we perform arithmetic based on this ten digit notation. This doesn’t always matter much for pen and paper maths but it’s an integral part of how we think about more complex operations and in particular how we think about accuracy. We see 0.1 as a finite decimal fraction, and so it’s only natural that we should be able to do accurate sums with it. And if we can do simple arithmetic, then surely computers can too? In this blog post I’m going to try and briefly explain what causes rounding errors such as the one above, and how we might get away with going beyond machine precision.

Take a number x \in [0; 1), say x=1/3. The decimal representation of x is of the form x=\sum_{i=1}^{\infty} a_i \times 10^{-i}. The a_i \in \{0, 1, \dots, 9\} here are the digits that go after the radix point. In the case of x=1/3 these are all equal a_i=3, or x=0.333\dots _{10}. Some numbers, such as our favourite x, don’t have a finite decimal expansion. Others, such as 0.3, do, meaning that after some i \in \mathbb{N}, all a_{i+j}=0. When we talk about rounding errors and accuracy, what we actually mean is that we only care about the first few digits, say i\leq 5, and we’re happy to approximate to x\approx \sum_{i=1}^{5} a_i \times 10^{-i}=0.33333, potentially rounding up at the last digit.

Computers, on the other hand, store numbers in base-2 rather than base-10, which means that they use a different series expansion x=\sum_{i=1}^{\infty} b_i \times 2^{-i}, b_i \in \{0, 1\} to represent the same number. Our favourite number x is actually stored as 0.1010101\dots _{2} rather than 0.3333333\dots _{10}, despite the fact it appears as the latter on a computer screen. Crucially, arithmetic is done in base-2 and, since only a finite number of binary digits are stored (i\leq 52 for most purposes these days), rounding errors also occur in base-2.

All numbers with a finite binary expansion, such as 0.25_{10}=0\times 1/2+1\times 1/4=0.01_{2} also have a finite decimal expansion, meaning we can do accurate arithmetic with them in both systems. However, the reverse isn’t true, which is what causes the issue with 0.1+0.1+0.1\neq 0.3. In binary, the nice and tidy 0.1_{10} = 0.00011001100\dots _{2}. We observe the rounding error because unlike us, the computer is trying to sum over infinite series.

While it’s not possible to do infinite sums with finite resources, there is a way to go beyond machine precision if you wanted to, at least for rational x=p/q, where p, q \in \mathbb{N}. In the example above, the issue comes from dividing by 10 on each side of the (in)equality. Luckily for us, we can avoid doing so. Integer arithmetic is easy in any base, and so

> (1+1+1) == 3 
TRUE

Shocking, I know. On a more serious note, it is possible to write an algorithm which calculates the binary expansion of x=p/q using only integer arithmetic. Usually binary expansion happens in this way:

set x, maxIter
initialise b, i=1
while x>0 AND i<=maxIter {
 if 2*x>=1
    b[i]=1
 else
    b[i]=0
 x = 2*x-b[i]
 i = i+1 
}
return b

Problems arise whenever we try to compute something non-integer (the highlighted lines 3, 4, and 8). However, we can rewrite these using x = p/q and  shifting the division by q to the right-hand side of each inequality / assignment operator:

set p, q, maxIter
initialise b, i=1 
while p>0 AND i<=maxIter { 
 if 2*p>=q 
    b[i]=1 
 else 
    b[i]=0 
 p = 2*p-b[i]*q 
 i = i+1 
} 
return b

Provided we’re not dealing with monstrously large integers (i.e. as long as we can safely double p), implementing the above lets us compute p/q with arbitrary precision given by maxIter. So we can beat machine precision for rationals! And the combination of arbitrarily accurate rationals and arbitrarily accurate series approximations (think Riemann zeta function, for example) means we can also get the occasional arbitrarily accurate irrational.

To sum up, rounding errors are annoying, partly because it’s not always intuitive when and how they happen. As a general rule the best way to avoid them is to make your computer do as little work as possible, and to avoid non-integer calculations whenever you can. But you already knew that, didn’t you?

This post was partially inspired by the undergraduate course on Simulation and Statistical Programming lectured by Prof Julien Berestycki and Prof Robin Evans. It was also inspired by my former maths teacher who used to mark us down for doing more work than necessary even when our solutions were correct. He had a point.

The Seven Summits

Last week my boyfriend Ben Rainthorpe returned from Argentina having successfully climbed Aconcagua – the highest mountain in South America. At a staggering 6963m above sea level it is the highest peak outside of the Himalayas. The climb took 20 days in total with a massive 14 hours of hiking and climbing on summit day.

Aconcagua is part of the mountaineering challenge known as the Seven Summits. This is achieved by summiting the highest mountain in each of the seven continents. This was first successfully completed in 1985 by Richard Bass. In 1992 Junko Tabei became the first woman to complete the challenge. In December Ben quit his job as a primary teacher to follow his dream of achieving this feat. Which mountains constitute the seven summits is debated and there are a number of different lists. In addition the challenge can be extended by including the highest volcano in each continent.

The Peaks:

1.Kilimanjaro – Africa (5895m) 

Kilimanjaro is usually the starting point for the challenge. At 5895 m above sea level and no technical climbing required it is a good introduction to high altitude trekking. However, this often means it is underestimated and the most common cause of death on the mountain is altitude sickness.

2. Aconcagua – South America (6963 m)

The next step up from Kilimanjaro Aconcagua is the second highest of the seven summits. However the lack of technical climbing required make it a good second peak to ascend after Kilimanjaro. For Aconcagua however, crampons and ice axes are required. The trek takes three weeks instead of one.

3. Elbrus – Europe (5,642 m)

Heralded as the Kilimanjaro of Europe, Elbrus even has a chair lift part of the way up! This mountain is regularly underestimated causing a high number of fatalities per year. Due to snowy conditions crampons and ice axes are once again required. Some believe that Elbrus should not count as the European peak and instead Mount Blanc should be summited – a much more technical and dangerous climb.

4. Denali – North America (6190 m).

Denali is a difficult mountain to summit. Although slightly lower than other peaks, the distance from the equator means the effects of altitude are more keenly felt. More technical skills are needed. In addition there are no porters to help carry additional gear so climbers must carry a full pack and drag a sled.

5. Vinson Massif – Antartica (4892 m).

Vinson is difficult because of the location rather than any technical climbing. The costs of going to Antartica are great and the conditions are something to be battled with.

6. Puncak Jaya – Australasia (4884 m) or Kosciuszko – Australia (2228 m)

The original Seven Summits included Mount Kosciuszko of Australia – the shortest and easiest climb on the list. However it is now generally agreed that Puncak Jaya is the offering from the Australasia continent. Despite being smaller than others on the list this is the hardest of the seven to climb with the highest technical rating. It is also located in an area that is highly inaccessible to the public due to a large mine, and is one of the few where a rescue by helicopter is not possible.

7. Everest – Asia (8848 m).

Everest is the highest mountain in the world at 8848 m above sea level. Many regard the trek to Everest Base Camp as challenge enough. Some technical climbing is required as well as bottled oxygen to safely reach altitudes of that level. One of the most dangerous parts is the Khumbu Icefall which must be traversed every time the climbers leave base camp. As of 2017 at least 300 people have died on Everest – most of their bodies still remain on the mountain.

Ben has now climbed two of the Seven Summits. His immediate plans are to tackle Elbrus in July (which I might try and tag along to) and Vinson next January. If you are interested in his progress check out his instagram (@benrainthorpe).

TCR Database

Back-to-back posting – I wanted to talk about the growing volume of TCR structures in the PDB. A couple of weeks ago, I presented my database to the group (STCRDab), which is now available at http://opig.stats.ox.ac.uk/webapps/stcrdab.

Unlike other databases, STCRDab is fully automated and updates on Fridays at 9AM (GMT), downloading new TCR structures and annotating them with the IMGT numbering (also applies for MHCs!). Although the size of the data is significantly smaller than, say, the number of antibody structures (currently at 3000+ structures and growing), the recent approval of CAR therapies (Kymriah, Yescarta), and the rise of interest in TCR engineering (e.g. Glanville et al., Nature, 2017; Dash et al., Nature, 2017) point toward the value of structures.

Feel free to read more in the paper, and here are some screenshots. 🙂

STCRDab front page.

Look! 5men, literally.

Possibly my new favourite PDB code.

STCRDab annotates structures automatically every Friday!