Monthly Archives: February 2018

A short intro to machine precision and how to beat it

Most people who’ve ever sat through a Rounding Error is Bad lecture will be familiar with the following example:

> (0.1+0.1+0.1) == 0.3
FALSE

The reason this is so unsettling is because most of the time we think about numbers in base-10. This means we use ten digits \{0, 1, \dots, 9\}, and we perform arithmetic based on this ten digit notation. This doesn’t always matter much for pen and paper maths but it’s an integral part of how we think about more complex operations and in particular how we think about accuracy. We see 0.1 as a finite decimal fraction, and so it’s only natural that we should be able to do accurate sums with it. And if we can do simple arithmetic, then surely computers can too? In this blog post I’m going to try and briefly explain what causes rounding errors such as the one above, and how we might get away with going beyond machine precision.

Take a number x \in [0; 1), say x=1/3. The decimal representation of x is of the form x=\sum_{i=1}^{\infty} a_i \times 10^{-i}. The a_i \in \{0, 1, \dots, 9\} here are the digits that go after the radix point. In the case of x=1/3 these are all equal a_i=3, or x=0.333\dots _{10}. Some numbers, such as our favourite x, don’t have a finite decimal expansion. Others, such as 0.3, do, meaning that after some i \in \mathbb{N}, all a_{i+j}=0. When we talk about rounding errors and accuracy, what we actually mean is that we only care about the first few digits, say i\leq 5, and we’re happy to approximate to x\approx \sum_{i=1}^{5} a_i \times 10^{-i}=0.33333, potentially rounding up at the last digit.

Computers, on the other hand, store numbers in base-2 rather than base-10, which means that they use a different series expansion x=\sum_{i=1}^{\infty} b_i \times 2^{-i}, b_i \in \{0, 1\} to represent the same number. Our favourite number x is actually stored as 0.1010101\dots _{2} rather than 0.3333333\dots _{10}, despite the fact it appears as the latter on a computer screen. Crucially, arithmetic is done in base-2 and, since only a finite number of binary digits are stored (i\leq 52 for most purposes these days), rounding errors also occur in base-2.

All numbers with a finite binary expansion, such as 0.25_{10}=0\times 1/2+1\times 1/4=0.01_{2} also have a finite decimal expansion, meaning we can do accurate arithmetic with them in both systems. However, the reverse isn’t true, which is what causes the issue with 0.1+0.1+0.1\neq 0.3. In binary, the nice and tidy 0.1_{10} = 0.00011001100\dots _{2}. We observe the rounding error because unlike us, the computer is trying to sum over infinite series.

While it’s not possible to do infinite sums with finite resources, there is a way to go beyond machine precision if you wanted to, at least for rational x=p/q, where p, q \in \mathbb{N}. In the example above, the issue comes from dividing by 10 on each side of the (in)equality. Luckily for us, we can avoid doing so. Integer arithmetic is easy in any base, and so

> (1+1+1) == 3 
TRUE

Shocking, I know. On a more serious note, it is possible to write an algorithm which calculates the binary expansion of x=p/q using only integer arithmetic. Usually binary expansion happens in this way:

set x, maxIter
initialise b, i=1
while x>0 AND i<=maxIter {
 if 2*x>=1
    b[i]=1
 else
    b[i]=0
 x = 2*x-b[i]
 i = i+1 
}
return b

Problems arise whenever we try to compute something non-integer (the highlighted lines 3, 4, and 8). However, we can rewrite these using x = p/q and  shifting the division by q to the right-hand side of each inequality / assignment operator:

set p, q, maxIter
initialise b, i=1 
while p>0 AND i<=maxIter { 
 if 2*p>=q 
    b[i]=1 
 else 
    b[i]=0 
 p = 2*p-b[i]*q 
 i = i+1 
} 
return b

Provided we’re not dealing with monstrously large integers (i.e. as long as we can safely double p), implementing the above lets us compute p/q with arbitrary precision given by maxIter. So we can beat machine precision for rationals! And the combination of arbitrarily accurate rationals and arbitrarily accurate series approximations (think Riemann zeta function, for example) means we can also get the occasional arbitrarily accurate irrational.

To sum up, rounding errors are annoying, partly because it’s not always intuitive when and how they happen. As a general rule the best way to avoid them is to make your computer do as little work as possible, and to avoid non-integer calculations whenever you can. But you already knew that, didn’t you?

This post was partially inspired by the undergraduate course on Simulation and Statistical Programming lectured by Prof Julien Berestycki and Prof Robin Evans. It was also inspired by my former maths teacher who used to mark us down for doing more work than necessary even when our solutions were correct. He had a point.

The Seven Summits

Last week my boyfriend Ben Rainthorpe returned from Argentina having successfully climbed Aconcagua – the highest mountain in South America. At a staggering 6963m above sea level it is the highest peak outside of the Himalayas. The climb took 20 days in total with a massive 14 hours of hiking and climbing on summit day.

Aconcagua is part of the mountaineering challenge known as the Seven Summits. This is achieved by summiting the highest mountain in each of the seven continents. This was first successfully completed in 1985 by Richard Bass. In 1992 Junko Tabei became the first woman to complete the challenge. In December Ben quit his job as a primary teacher to follow his dream of achieving this feat. Which mountains constitute the seven summits is debated and there are a number of different lists. In addition the challenge can be extended by including the highest volcano in each continent.

The Peaks:

1.Kilimanjaro – Africa (5895m) 

Kilimanjaro is usually the starting point for the challenge. At 5895 m above sea level and no technical climbing required it is a good introduction to high altitude trekking. However, this often means it is underestimated and the most common cause of death on the mountain is altitude sickness.

2. Aconcagua – South America (6963 m)

The next step up from Kilimanjaro Aconcagua is the second highest of the seven summits. However the lack of technical climbing required make it a good second peak to ascend after Kilimanjaro. For Aconcagua however, crampons and ice axes are required. The trek takes three weeks instead of one.

3. Elbrus – Europe (5,642 m)

Heralded as the Kilimanjaro of Europe, Elbrus even has a chair lift part of the way up! This mountain is regularly underestimated causing a high number of fatalities per year. Due to snowy conditions crampons and ice axes are once again required. Some believe that Elbrus should not count as the European peak and instead Mount Blanc should be summited – a much more technical and dangerous climb.

4. Denali – North America (6190 m).

Denali is a difficult mountain to summit. Although slightly lower than other peaks, the distance from the equator means the effects of altitude are more keenly felt. More technical skills are needed. In addition there are no porters to help carry additional gear so climbers must carry a full pack and drag a sled.

5. Vinson Massif – Antartica (4892 m).

Vinson is difficult because of the location rather than any technical climbing. The costs of going to Antartica are great and the conditions are something to be battled with.

6. Puncak Jaya – Australasia (4884 m) or Kosciuszko – Australia (2228 m)

The original Seven Summits included Mount Kosciuszko of Australia – the shortest and easiest climb on the list. However it is now generally agreed that Puncak Jaya is the offering from the Australasia continent. Despite being smaller than others on the list this is the hardest of the seven to climb with the highest technical rating. It is also located in an area that is highly inaccessible to the public due to a large mine, and is one of the few where a rescue by helicopter is not possible.

7. Everest – Asia (8848 m).

Everest is the highest mountain in the world at 8848 m above sea level. Many regard the trek to Everest Base Camp as challenge enough. Some technical climbing is required as well as bottled oxygen to safely reach altitudes of that level. One of the most dangerous parts is the Khumbu Icefall which must be traversed every time the climbers leave base camp. As of 2017 at least 300 people have died on Everest – most of their bodies still remain on the mountain.

Ben has now climbed two of the Seven Summits. His immediate plans are to tackle Elbrus in July (which I might try and tag along to) and Vinson next January. If you are interested in his progress check out his instagram (@benrainthorpe).

TCR Database

Back-to-back posting – I wanted to talk about the growing volume of TCR structures in the PDB. A couple of weeks ago, I presented my database to the group (STCRDab), which is now available at http://opig.stats.ox.ac.uk/webapps/stcrdab.

Unlike other databases, STCRDab is fully automated and updates on Fridays at 9AM (GMT), downloading new TCR structures and annotating them with the IMGT numbering (also applies for MHCs!). Although the size of the data is significantly smaller than, say, the number of antibody structures (currently at 3000+ structures and growing), the recent approval of CAR therapies (Kymriah, Yescarta), and the rise of interest in TCR engineering (e.g. Glanville et al., Nature, 2017; Dash et al., Nature, 2017) point toward the value of structures.

Feel free to read more in the paper, and here are some screenshots. 🙂

STCRDab front page.

Look! 5men, literally.

Possibly my new favourite PDB code.

STCRDab annotates structures automatically every Friday!

ABodyBuilder and model quality

Currently I’m working on developing a new strategy to use FREAD within the ABodyBuilder pipeline. While running some tests I’ve realised that some of the RMSD values that there were some minor miscalculations of CDR loops’ RMSD in my paper.

To start with, the main message of the paper remains the same; the overall quality of the models (Fv RMSD) was correct, and still is. ABodyBuilder isn’t necessarily the most accurate modelling methodology per se, but it’s unique in its ability to estimate RMSD. ABodyBuilder would still be capable of doing this calculation regardless of what the CDR loops’ RMSD may be. This is because the accuracy estimation looks at the RMSD data and places a probability that a new model structure would have some RMSD value “x” (given the CDR loop’s length). Our website has now been updated in light of these changes too.

Update to Figure 2 of the paper.

Update to Figure S4 of the paper.

Update to Figure S5 of the paper.