Monthly Archives: June 2018

Le Tour de Farce v6.0

Tuesday the 12th of June brought sun, cycling and beer to the land of OPIG. It was once again time for the annual Tour de Farce.

Le tour, now in its highly refined 6.0 version, covered a route which took us from the statistics department at Oxford (home to many OPIGlets) to our first port of call, The Head of the River pub. We then followed the river Isis (or Thames if you prefer) from the head of the river towards Osney mead.

Passing though Osney we soon arrived at our second waypoint. The Punter. One of the OPIGlets lived locally and so we were met by their trusty companion, who was better behaved than many of the others on Le Tour.

Departing the punter on two wheels (or in one case, on one) we followed the river upstream to The Perch.

 

Our arrival at The Perch was slightly hampered by a certain OPIGlet taking out anything in her path in her excitement.  Mr Sulu.. Ramming speed.

 

Those that survived soon left the perch, as we were once again headed upstream, this time to The Trout.

Having braved about half the journey it was now time for another restorative beverage and to take on supplies.  Sustenance was provided by Jacob’s Inn.   Jacob’s Inn has the advantage of goats, chickens and pigs in the back garden.  Having spent most of the afternoon in each other’s company, the company of pigs was preferable for some.

As we finished dinner, the sun was beginning to set and so we abandoned the original plan of finishing off at The Fishes.  Instead we returned southwards where we closed off the evening with a drink at The Royal Oak, mere yards from where we started the day.

The route of the 2018 v6.0 Tour de Farce.

 

 

Storing your stuff with clever filesystems: ZFS and tmpfs

The filesystem is a a critical component of just about any operating system, however it’s often overlooked. When setting up a new server, the default filesystem options are often ticked and never thought about again. However, there exist a couple of filesystems which can provide some extraordinary features and speed. I’m talking about ZFS and tmpfs.

ZFS was originally developed by Oracle for their Solaris operating system, but has now been open-sourced and is freely available on linux. Tmpfs is a temporary file system which uses system memory to provide fast temporary storage for files. Together, they can provide outstanding reliability and speed for not very much effort.

Hard disk capacity has increased exponentially over the last 50 years. In the 1960s, you could rent a 5MB hard disk from IBM for the equivalent of $130,000 per month. Today you can buy for less than $600 a 12TB disk – a 2,400,000 times increase.

As storage technology has moved on, the filesystems which sit on top of them ideally need to be able to access the full capacity of those ever increasing disks. Many relatively new, or at least in-use, filesystems have serious limitations. Akin to “640K ought to be enough for anybody”, the likes of the FAT32 filesystem supports files which are at most 4GB on a chunk of disk (a partition) which can be at most 16TB. Bear in mind that arrays of disks can provide a working capacity of many times that of a single disk. You can buy the likes of a supermicro sc946ed shelf which will add 90 disks to your server. In an ideal world, as you buy bigger disks you should be able to pile them into your computer and tell your existing filesystem to make use of them, your filesystem should grow and you won’t have to remember a different drive letter or path depending on the hardware you’re using.

ZFS is a 128-bit file system, which means a single installation maxes out at 256 quadrillion zettabytes. All metadata is allocated dynamically so there isn’t the need to pre-allocate inodes and directories can have up to 2^48 (256 trillion) entries. ZFS provides the concept of “vdevs” (virtual devices) which can be a single disk or redundant/striped collections of multiple disks. These can be dynamically added to a pool of vdevs of the same type and your storage will grow onto the fresh hardware.

A further consideration is that both disks of the “spinning rust” variety and SSDs are subject to silent data corruption, i.e. “bit rot”. This can be caused by a number of factors even including cosmic rays, but the consequence is read errors when it comes time to retrieve your data. Manufacturers are aware of this and buried in the small print for your hard disk will be values for “unrecoverable read errors” i.e. data loss. ZFS works around this by providing several mechanisms:

  • Checksums for each block of data written.
  • Checksums for each pointer to data.
  • Scrub – Automatically validates checksums when the system is idle.
  • Multiple copies – Even if you have a single disk, it’s possible to provide redundancy by setting a copies=n variable during filesystem creation.
  • Self-healing – When a bad data block is detected, ZFS fetches the correct data from a redundant copy and replaces it with the correct data.

An additional bonus to ZFS is its ability to de-duplicate data. Should you be working with a number of very similar files, on a normal filesystem, each file will take up space proportional to the amount of data that’s contained. As ZFS keeps checksums of each block of data, it’s able to determine if two blocks contain identical data. ZFS therefore provides the ability to have multiple pointers to the same file and only store the differences between them.

 

ZFS also provides the ability to take a point in time snapshot of the entire filesystem and roll it back to a previous time. If you’re a software developer, got a package that has 101 dependencies and you need to upgrade it? Afraid to upgrade it in case it breaks things horribly? Working on code and you want to roll back to a previous version? ZFS snapshots can be run with cron or manually and provide a version of the filesystem which can be used to extract previous versions of overwritten or deleted files or used to roll everything back to a point in time when it worked.

Similar to deduplication, a snapshot won’t take up any disk extra space until the data starts to change.

The other filesystem worth mentioning is tmpfs. Tmpfs takes part of the system memory and turns it into a usable filesystem. This is incredibly useful for systems which create huge numbers of temporary files and attempt to re-read them. Tmpfs is also just about as fast as a filesystem can be. Compared to a single SSD or a RAID array of six disks, tmpfs blows them out of the water speed wise.

Creating a tmpfs filesystem is simple:
First create your mountpoint for the disk:

mkdir /mnt/ramdisk

Then mount it. The options are saying make it 1GB in size, it’s of type tmpfs and to mount it at the previously created mount point:

mount –t tmpfs -o size=1024m tmpfs /mnt/ramdisk

At this point, you can use it like any other filesystem:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1  218G 128G   80G  62% /
/dev/sdb1  6.3T 2.4T  3.6T  40% /spinnyrust
tank       946G 3.5G  942G   1% /tank
tmpfs      1.0G 391M  634M  39% /mnt/ramdisk

OPIG at the Oxford Maths Festival

Men with glasses poring over long columns of numbers. Tabulation of averages and creation of data tables. Lots of counting. The public image of statistics hardly corresponds to what OPIG do – even where OPIG’s work is at its most formally statistical.

OPIG exhibited a street stall at the Oxford Maths Festival to try to change that perception. How do you interest passers-by in real statistics without condescending and without oversimplifying? Data is becoming more important in the lives of all kinds of people and we need to be clear that it isn’t magic, but neither is it trivial. We need to prove that the kind of thoughtful reasoning that people put into managing their lives is the same kind of thing we do in data analysis.

Let’s look at one activity that OPIG did on the street at the Oxford Maths Festival.

The idea

We started with a compelling story: rumors were flying during the Second World War about how many tanks the Germans were producing. Allied intelligence needed to figure out if these numbers were true. In its simple retelling, the Germans simply numbered their tanks, but in truth there were two sequences of gearbox numbers, complicated chassis and engine numbers, and a set of numbered wheel moulds which left numbers imprinted in each of the wheels of the tanks.

However you parse it, the relevant problem is finding out how many gearboxes, chassis, wheels, or engines were in use, which tells you how many tanks are being produced, since no tank can have, for example, more than one engine. It’s a little hard to get close to a German tank, so Allied soldiers collected these numbers when they were destroyed.

This problem – finding the largest number of a range 1, 2, 3, 4, … N – is known generally as the German Tank Problem. Most mathematics educators have some familiarity with it, but on the street we found it works really well. This observation presumably speaks to the paucity of mathematics educators on the street.

How it works

The demonstration, while straightforward and quick, has a few subtleties. We cut open a box and cut 4750 slips of paper, numbered from 1 to 4750.

The first step is to ask how many slips of paper are in the box. Answers varied from one hundred to more than 10,000. Shaking the box helped encourage unreliable guesses and prevented people from reading the numbers off the slips in the box.

Then we asked people to pick one piece of paper out of the box and update their guess. We found a few aspects of this to be interesting. It is trivial to convince people that they will not get the top number in their guess. We all, then, have an intuition that the number drawn at this stage should be ‘somewhere in the middle’, which is completely wrong. There’s no particular reason to think that the guess should be in any interval over the distribution. It is true, however, that the mean of the first number chosen after many runs of the demonstration is 2375 and reasoning about average behaviour in this way turns out to be very powerful.

We then would ask people to draw up to four more slips. The point that people should absorb at this stage is that those guesses should assort evenly over the possible distribution, and you only need to add ‘a little bit more’ to compensate for this effect. We then precomputed the best guess for various numbers because the calculation is too tedious for a streetcorner – from which it becomes obvious that having more than five slips is nearly unnecessary to guess the maximum well. (See the histogram of estimates from five-slip draws below.) And 60% of guesses were slightly high of the true number, but the guesses that were too low tended to be much too low.

Going from blind indifference to a really solid guess is a powerful experience, and we can take people through it with every step of the reasoning on display. It shows what data analysis looks like at the research level and can be a great experience for the public.

How to parse OAS data

We have recently released the Observed Antibody Space database – collection of cleaned and annotated antibody sequence (Ig-seq or AIRR-seq) data from 53 studies. We have formatted the data in the format that should facilitate data mining and since release we had several queries on how to parse the data out. Therefore here we give a small example of how to parse the data and make sense of it.

You should download the bulk data file from OAS, available here.

The datasets are separated into ‘data units‘  – collections of sequences that can be uniquely assigned to a range of metadata parameters such as study, organism etc. Our task therefore is to iterate through all those files and read sequences from each of these. Firstly we will attempt to iterate through the files and I will assume that you uncompressed the bulk data file into ../data/json folder. We will write a helper function that will simply list all files with its full paths in a directory and call it list_file_paths

import os

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        print f

The code above will list all the files in ../data/json which incidentally are all the ‘data units’. Now our task is to parse out the output from each of the data units. They are gzipped files with data element on each line. Therefore we will use gzip library to stream the contents of a gzipped file rather than uncompressing each one of them separately. This is achieved by function parse_single_file

import os,gzip

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        print line
    

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The code above will simply go through all the data unit files, stream the gzipped lines and print each one of them separately. Each line however is formatted as json – meaning it can be parsed using pythons json library and act as a pythonic dictionary. below we have parsed out the basic elements in the final incarnation of the code:

import os,gzip,json,pprint

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        if meta_line == True:
                metadata = json.loads(line)
                meta_line=False
                print "Metadata:"
                pprint.pprint(metadata)
                continue
        #Parse actual sequence data.
        basic_data = json.loads(line)
        print "Basic data:"
        pprint.pprint(basic_data)

        #IMGT-Numbered sequence.
        print "IMGT-numbered sequence"
        d = json.loads(basic_data['data'])
        pprint.pprint(d)
        print "==========="
    
if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The first line of each data unit are meta entries. These look as follows:

{u'Age': u'22-70',
 u'Author': u'Halliley et al., (2015)',
 u'BSource': u'Bone-Marrow',
 u'BType': u'Plasma-B-Cells',
 u'Chain': u'Heavy',
 u'Disease': u'None',
 u'Isotype': u'IGHM',
 u'Link': u'https://doi.org/10.1016/j.immuni.2015.06.016',
 u'Longitudinal': u'no',
 u'Size': 934,
 u'Species': u'human',
 u'Subject': u'no',
 u'Vaccine': u'Tetanus/Flu'}

The attributes should be self-explanatory and the existence of this data on top of each file is supposed to streamline searching through data-units if you wish to parse sequences given a particular configuration of meta-data entries (e.g. organism).

Next, the code parses out data on each sequence that is associated with its genes, full sequence, CDR3 and numbered sequence. Therefore the output for this will look something like this:

{u'cdr3': u'ARHQGVYWVTTAGLSH',
 u'data': u'{"fwh1": {"11": "G", "24": "T", "13": "V", "12": "L", "15": "P", "14": "K", "17": "E", "16": "S", "19": "L", "18": "T", "22": "T", "26": "S", "25": "V", "21": "L", "20": "S", "23": "C"}, "fwh3": {"68": "N", "88": "S", "89": "L", "66": "Y", "67": "Y", "82": "T", "83": "S", "80": "V", "81": "D", "86": "Q", "87": "F", "84": "K", "85": "N", "92": "S", "79": "S", "69": "P", "104": "C", "78": "I", "77": "T", "76": "V", "75": "R", "74": "S", "72": "K", "71": "L", "70": "S", "102": "Y", "90": "K", "100": "A", "101": "V", "95": "T", "94": "V", "97": "A", "96": "A", "91": "L", "99": "T", "98": "D", "93": "S", "103": "Y"}, "fwh2": {"52": "W", "39": "W", "48": "Q", "49": "G", "46": "P", "47": "G", "44": "Q", "45": "P", "51": "E", "43": "R", "40": "G", "42": "I", "55": "S", "53": "I", "54": "G", "41": "W", "50": "L"}, "fwh4": {"120": "Q", "121": "G", "122": "T", "123": "L", "124": "V", "125": "P", "126": "V", "127": "S", "128": "S", "119": "G", "118": "W"}, "cdrh1": {"27": "G", "37": "Y", "31": "S", "30": "I", "28": "G", "29": "S", "35": "S", "34": "S", "38": "Y", "36": "S"}, "cdrh2": {"59": "S", "58": "Y", "57": "S", "56": "I", "63": "G", "64": "T", "65": "T"}, "cdrh3": {"111A": "W", "109": "G", "108": "Q", "115": "L", "114": "G", "117": "H", "116": "S", "111": "Y", "110": "V", "113": "A", "112": "T", "112A": "T", "112B": "V", "106": "R", "107": "H", "105": "A"}}',
 u'j': u'IGHJ1*01',
 u'name': 12,
 u'redundancy': 1,
 u'seq': u'GLVKPSETLSLTCTVSGGSISSSSYYWGWIRQPPGQGLEWIGSISYSGTTYYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARHQGVYWVTTAGLSHWGQGTLVPVSS',
 u'v': u'IGHV4-39*07'}

Above, redundancy refers to how many times we see a given sequence (seq) in a particular study. We also store the IMGT-numbered data (the data attribute) which needs a second round of json parsing and its output is a dictionary of IMGT-number – amino acid associations grouped by the regions of an antibody (cdrs and framework regions):

{u'cdrh1': {u'27': u'G',
            u'28': u'G',
            u'29': u'S',
            u'30': u'I',
            u'31': u'S',
            u'34': u'S',
            u'35': u'S',
            u'36': u'S',
            u'37': u'Y',
            u'38': u'Y'},
 u'cdrh2': {u'56': u'I',
            u'57': u'S',
            u'58': u'Y',
            u'59': u'S',
            u'63': u'G',
            u'64': u'T',
            u'65': u'T'},
 u'cdrh3': {u'105': u'A',
            u'106': u'R',
            u'107': u'H',
            u'108': u'Q',
            u'109': u'G',
            u'110': u'V',
            u'111': u'Y',
            u'111A': u'W',
            u'112': u'T',
            u'112A': u'T',
            u'112B': u'V',
            u'113': u'A',
            u'114': u'G',
            u'115': u'L',
            u'116': u'S',
            u'117': u'H'},
 u'fwh1': {u'11': u'G',
           u'12': u'L',
           u'13': u'V',
           u'14': u'K',
           u'15': u'P',
           u'16': u'S',
           u'17': u'E',
           u'18': u'T',
           u'19': u'L',
           u'20': u'S',
           u'21': u'L',
           u'22': u'T',
           u'23': u'C',
           u'24': u'T',
           u'25': u'V',
           u'26': u'S'},
 u'fwh2': {u'39': u'W',
           u'40': u'G',
           u'41': u'W',
           u'42': u'I',
           u'43': u'R',
           u'44': u'Q',
           u'45': u'P',
           u'46': u'P',
           u'47': u'G',
           u'48': u'Q',
           u'49': u'G',
           u'50': u'L',
           u'51': u'E',
           u'52': u'W',
           u'53': u'I',
           u'54': u'G',
           u'55': u'S'},
 u'fwh3': {u'100': u'A',
           u'101': u'V',
           u'102': u'Y',
           u'103': u'Y',
           u'104': u'C',
           u'66': u'Y',
           u'67': u'Y',
           u'68': u'N',
           u'69': u'P',
           u'70': u'S',
           u'71': u'L',
           u'72': u'K',
           u'74': u'S',
           u'75': u'R',
           u'76': u'V',
           u'77': u'T',
           u'78': u'I',
           u'79': u'S',
           u'80': u'V',
           u'81': u'D',
           u'82': u'T',
           u'83': u'S',
           u'84': u'K',
           u'85': u'N',
           u'86': u'Q',
           u'87': u'F',
           u'88': u'S',
           u'89': u'L',
           u'90': u'K',
           u'91': u'L',
           u'92': u'S',
           u'93': u'S',
           u'94': u'V',
           u'95': u'T',
           u'96': u'A',
           u'97': u'A',
           u'98': u'D',
           u'99': u'T'},
 u'fwh4': {u'118': u'W',
           u'119': u'G',
           u'120': u'Q',
           u'121': u'G',
           u'122': u'T',
           u'123': u'L',
           u'124': u'V',
           u'125': u'P',
           u'126': u'V',
           u'127': u'S',
           u'128': u'S'}}

We hope this quick intro to our data format will allow you to do great science with this data.