Maps are useful. But first, you need to build, store and read a map.

Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help.

In C++, map is one of the ways you can store a string-key and an integer value. Since we are concerned about the data storage and access, I compared map and unordered_map.

An unordered_map stores a hash table of the keys and the mapped value; while a map is ordered. The important consideration here includes:

Memory: map does not have the hash table and is therefore smaller than an unordered_map.
Access: accessing an unordered_map takes O(1) while accessing a map takes log(n).

I have eventually chosen to go with map, because it is more memory efficient considering the small RAM size that I have access to. However, it still takes up about 8GB of RAM per object during its runtime (and I have 1800 objects to run through, each building a different dictionary). Saving these seems to open another can of worm.

In Python, we could easily use Pickle or JSON to serialise the dictionary. In C++, it’s common to use the BOOST library. There are two archival functions in BOOST: text or binary archives. Text archives are human-readable but I don’t think I am really going to open and read 10M+ lines of key-value pairs, I opted for binary archives that are machine readable and smaller. (Read more: https://stackoverflow.com/questions/1058051/boost-serialization-performance-text-vs-binary-format .)

To further compress the memory size when I save the maps, I used zlib compression. Obviously there are ready-to-use codes from these people half a year ago, which saved me debugging:

https://stackoverflow.com/questions/48034374/compressing-boost-binary-archive-with-zlib

Ultimately this gets down to 96GB summing 1800 files, all done within 6 hours.

Le Tour de Farce v6.0

Tuesday the 12th of June brought sun, cycling and beer to the land of OPIG. It was once again time for the annual Tour de Farce.

Le tour, now in its highly refined 6.0 version, covered a route which took us from the statistics department at Oxford (home to many OPIGlets) to our first port of call, The Head of the River pub. We then followed the river Isis (or Thames if you prefer) from the head of the river towards Osney mead.

Passing though Osney we soon arrived at our second waypoint. The Punter. One of the OPIGlets lived locally and so we were met by their trusty companion, who was better behaved than many of the others on Le Tour.

Departing the punter on two wheels (or in one case, on one) we followed the river upstream to The Perch.

Our arrival at The Perch was slightly hampered by a certain OPIGlet taking out anything in her path in her excitement. Mr Sulu.. Ramming speed.

Those that survived soon left the perch, as we were once again headed upstream, this time to The Trout.

Having braved about half the journey it was now time for another restorative beverage and to take on supplies. Sustenance was provided by Jacob’s Inn. Jacob’s Inn has the advantage of goats, chickens and pigs in the back garden. Having spent most of the afternoon in each other’s company, the company of pigs was preferable for some.

As we finished dinner, the sun was beginning to set and so we abandoned the original plan of finishing off at The Fishes. Instead we returned southwards where we closed off the evening with a drink at The Royal Oak, mere yards from where we started the day.

The route of the 2018 v6.0 Tour de Farce.

Storing your stuff with clever filesystems: ZFS and tmpfs

The filesystem is a a critical component of just about any operating system, however it’s often overlooked. When setting up a new server, the default filesystem options are often ticked and never thought about again. However, there exist a couple of filesystems which can provide some extraordinary features and speed. I’m talking about ZFS and tmpfs.

ZFS was originally developed by Oracle for their Solaris operating system, but has now been open-sourced and is freely available on linux. Tmpfs is a temporary file system which uses system memory to provide fast temporary storage for files. Together, they can provide outstanding reliability and speed for not very much effort.

Hard disk capacity has increased exponentially over the last 50 years. In the 1960s, you could rent a 5MB hard disk from IBM for the equivalent of $130,000 per month. Today you can buy for less than $600 a 12TB disk – a 2,400,000 times increase.

As storage technology has moved on, the filesystems which sit on top of them ideally need to be able to access the full capacity of those ever increasing disks. Many relatively new, or at least in-use, filesystems have serious limitations. Akin to “640K ought to be enough for anybody”, the likes of the FAT32 filesystem supports files which are at most 4GB on a chunk of disk (a partition) which can be at most 16TB. Bear in mind that arrays of disks can provide a working capacity of many times that of a single disk. You can buy the likes of a supermicro sc946ed shelf which will add 90 disks to your server. In an ideal world, as you buy bigger disks you should be able to pile them into your computer and tell your existing filesystem to make use of them, your filesystem should grow and you won’t have to remember a different drive letter or path depending on the hardware you’re using.

ZFS is a 128-bit file system, which means a single installation maxes out at 256 quadrillion zettabytes. All metadata is allocated dynamically so there isn’t the need to pre-allocate inodes and directories can have up to 2^48 (256 trillion) entries. ZFS provides the concept of “vdevs” (virtual devices) which can be a single disk or redundant/striped collections of multiple disks. These can be dynamically added to a pool of vdevs of the same type and your storage will grow onto the fresh hardware.

A further consideration is that both disks of the “spinning rust” variety and SSDs are subject to silent data corruption, i.e. “bit rot”. This can be caused by a number of factors even including cosmic rays, but the consequence is read errors when it comes time to retrieve your data. Manufacturers are aware of this and buried in the small print for your hard disk will be values for “unrecoverable read errors” i.e. data loss. ZFS works around this by providing several mechanisms:

Checksums for each block of data written.
Checksums for each pointer to data.
Scrub – Automatically validates checksums when the system is idle.
Multiple copies – Even if you have a single disk, it’s possible to provide redundancy by setting a copies=n variable during filesystem creation.
Self-healing – When a bad data block is detected, ZFS fetches the correct data from a redundant copy and replaces it with the correct data.

An additional bonus to ZFS is its ability to de-duplicate data. Should you be working with a number of very similar files, on a normal filesystem, each file will take up space proportional to the amount of data that’s contained. As ZFS keeps checksums of each block of data, it’s able to determine if two blocks contain identical data. ZFS therefore provides the ability to have multiple pointers to the same file and only store the differences between them.

ZFS also provides the ability to take a point in time snapshot of the entire filesystem and roll it back to a previous time. If you’re a software developer, got a package that has 101 dependencies and you need to upgrade it? Afraid to upgrade it in case it breaks things horribly? Working on code and you want to roll back to a previous version? ZFS snapshots can be run with cron or manually and provide a version of the filesystem which can be used to extract previous versions of overwritten or deleted files or used to roll everything back to a point in time when it worked.

Similar to deduplication, a snapshot won’t take up any disk extra space until the data starts to change.

The other filesystem worth mentioning is tmpfs. Tmpfs takes part of the system memory and turns it into a usable filesystem. This is incredibly useful for systems which create huge numbers of temporary files and attempt to re-read them. Tmpfs is also just about as fast as a filesystem can be. Compared to a single SSD or a RAID array of six disks, tmpfs blows them out of the water speed wise.

Creating a tmpfs filesystem is simple:
First create your mountpoint for the disk:

mkdir /mnt/ramdisk

Then mount it. The options are saying make it 1GB in size, it’s of type tmpfs and to mount it at the previously created mount point:

mount –t tmpfs -o size=1024m tmpfs /mnt/ramdisk

At this point, you can use it like any other filesystem:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1  218G 128G   80G  62% /
/dev/sdb1  6.3T 2.4T  3.6T  40% /spinnyrust
tank       946G 3.5G  942G   1% /tank
tmpfs      1.0G 391M  634M  39% /mnt/ramdisk

OPIG at the Oxford Maths Festival

Men with glasses poring over long columns of numbers. Tabulation of averages and creation of data tables. Lots of counting. The public image of statistics hardly corresponds to what OPIG do – even where OPIG’s work is at its most formally statistical.

OPIG exhibited a street stall at the Oxford Maths Festival to try to change that perception. How do you interest passers-by in real statistics without condescending and without oversimplifying? Data is becoming more important in the lives of all kinds of people and we need to be clear that it isn’t magic, but neither is it trivial. We need to prove that the kind of thoughtful reasoning that people put into managing their lives is the same kind of thing we do in data analysis.

Let’s look at one activity that OPIG did on the street at the Oxford Maths Festival.

The idea

We started with a compelling story: rumors were flying during the Second World War about how many tanks the Germans were producing. Allied intelligence needed to figure out if these numbers were true. In its simple retelling, the Germans simply numbered their tanks, but in truth there were two sequences of gearbox numbers, complicated chassis and engine numbers, and a set of numbered wheel moulds which left numbers imprinted in each of the wheels of the tanks.

However you parse it, the relevant problem is finding out how many gearboxes, chassis, wheels, or engines were in use, which tells you how many tanks are being produced, since no tank can have, for example, more than one engine. It’s a little hard to get close to a German tank, so Allied soldiers collected these numbers when they were destroyed.

This problem – finding the largest number of a range 1, 2, 3, 4, … N – is known generally as the German Tank Problem. Most mathematics educators have some familiarity with it, but on the street we found it works really well. This observation presumably speaks to the paucity of mathematics educators on the street.

How it works

The demonstration, while straightforward and quick, has a few subtleties. We cut open a box and cut 4750 slips of paper, numbered from 1 to 4750.

The first step is to ask how many slips of paper are in the box. Answers varied from one hundred to more than 10,000. Shaking the box helped encourage unreliable guesses and prevented people from reading the numbers off the slips in the box.

Then we asked people to pick one piece of paper out of the box and update their guess. We found a few aspects of this to be interesting. It is trivial to convince people that they will not get the top number in their guess. We all, then, have an intuition that the number drawn at this stage should be ‘somewhere in the middle’, which is completely wrong. There’s no particular reason to think that the guess should be in any interval over the distribution. It is true, however, that the mean of the first number chosen after many runs of the demonstration is 2375 and reasoning about average behaviour in this way turns out to be very powerful.

We then would ask people to draw up to four more slips. The point that people should absorb at this stage is that those guesses should assort evenly over the possible distribution, and you only need to add ‘a little bit more’ to compensate for this effect. We then precomputed the best guess for various numbers because the calculation is too tedious for a streetcorner – from which it becomes obvious that having more than five slips is nearly unnecessary to guess the maximum well. (See the histogram of estimates from five-slip draws below.) And 60% of guesses were slightly high of the true number, but the guesses that were too low tended to be much too low.

Going from blind indifference to a really solid guess is a powerful experience, and we can take people through it with every step of the reasoning on display. It shows what data analysis looks like at the research level and can be a great experience for the public.

How to parse OAS data

We have recently released the Observed Antibody Space database – collection of cleaned and annotated antibody sequence (Ig-seq or AIRR-seq) data from 53 studies. We have formatted the data in the format that should facilitate data mining and since release we had several queries on how to parse the data out. Therefore here we give a small example of how to parse the data and make sense of it.

You should download the bulk data file from OAS, available here.

The datasets are separated into ‘data units‘ – collections of sequences that can be uniquely assigned to a range of metadata parameters such as study, organism etc. Our task therefore is to iterate through all those files and read sequences from each of these. Firstly we will attempt to iterate through the files and I will assume that you uncompressed the bulk data file into ../data/json folder. We will write a helper function that will simply list all files with its full paths in a directory and call it list_file_paths

import os

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        print f

The code above will list all the files in ../data/json which incidentally are all the ‘data units’. Now our task is to parse out the output from each of the data units. They are gzipped files with data element on each line. Therefore we will use gzip library to stream the contents of a gzipped file rather than uncompressing each one of them separately. This is achieved by function parse_single_file

import os,gzip

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        print line
    

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The code above will simply go through all the data unit files, stream the gzipped lines and print each one of them separately. Each line however is formatted as json – meaning it can be parsed using pythons json library and act as a pythonic dictionary. below we have parsed out the basic elements in the final incarnation of the code:

import os,gzip,json,pprint

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        if meta_line == True:
                metadata = json.loads(line)
                meta_line=False
                print "Metadata:"
                pprint.pprint(metadata)
                continue
        #Parse actual sequence data.
        basic_data = json.loads(line)
        print "Basic data:"
        pprint.pprint(basic_data)

        #IMGT-Numbered sequence.
        print "IMGT-numbered sequence"
        d = json.loads(basic_data['data'])
        pprint.pprint(d)
        print "==========="
    
if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The first line of each data unit are meta entries. These look as follows:

{u'Age': u'22-70',
 u'Author': u'Halliley et al., (2015)',
 u'BSource': u'Bone-Marrow',
 u'BType': u'Plasma-B-Cells',
 u'Chain': u'Heavy',
 u'Disease': u'None',
 u'Isotype': u'IGHM',
 u'Link': u'https://doi.org/10.1016/j.immuni.2015.06.016',
 u'Longitudinal': u'no',
 u'Size': 934,
 u'Species': u'human',
 u'Subject': u'no',
 u'Vaccine': u'Tetanus/Flu'}

The attributes should be self-explanatory and the existence of this data on top of each file is supposed to streamline searching through data-units if you wish to parse sequences given a particular configuration of meta-data entries (e.g. organism).

Next, the code parses out data on each sequence that is associated with its genes, full sequence, CDR3 and numbered sequence. Therefore the output for this will look something like this:

{u'cdr3': u'ARHQGVYWVTTAGLSH',
 u'data': u'{"fwh1": {"11": "G", "24": "T", "13": "V", "12": "L", "15": "P", "14": "K", "17": "E", "16": "S", "19": "L", "18": "T", "22": "T", "26": "S", "25": "V", "21": "L", "20": "S", "23": "C"}, "fwh3": {"68": "N", "88": "S", "89": "L", "66": "Y", "67": "Y", "82": "T", "83": "S", "80": "V", "81": "D", "86": "Q", "87": "F", "84": "K", "85": "N", "92": "S", "79": "S", "69": "P", "104": "C", "78": "I", "77": "T", "76": "V", "75": "R", "74": "S", "72": "K", "71": "L", "70": "S", "102": "Y", "90": "K", "100": "A", "101": "V", "95": "T", "94": "V", "97": "A", "96": "A", "91": "L", "99": "T", "98": "D", "93": "S", "103": "Y"}, "fwh2": {"52": "W", "39": "W", "48": "Q", "49": "G", "46": "P", "47": "G", "44": "Q", "45": "P", "51": "E", "43": "R", "40": "G", "42": "I", "55": "S", "53": "I", "54": "G", "41": "W", "50": "L"}, "fwh4": {"120": "Q", "121": "G", "122": "T", "123": "L", "124": "V", "125": "P", "126": "V", "127": "S", "128": "S", "119": "G", "118": "W"}, "cdrh1": {"27": "G", "37": "Y", "31": "S", "30": "I", "28": "G", "29": "S", "35": "S", "34": "S", "38": "Y", "36": "S"}, "cdrh2": {"59": "S", "58": "Y", "57": "S", "56": "I", "63": "G", "64": "T", "65": "T"}, "cdrh3": {"111A": "W", "109": "G", "108": "Q", "115": "L", "114": "G", "117": "H", "116": "S", "111": "Y", "110": "V", "113": "A", "112": "T", "112A": "T", "112B": "V", "106": "R", "107": "H", "105": "A"}}',
 u'j': u'IGHJ1*01',
 u'name': 12,
 u'redundancy': 1,
 u'seq': u'GLVKPSETLSLTCTVSGGSISSSSYYWGWIRQPPGQGLEWIGSISYSGTTYYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARHQGVYWVTTAGLSHWGQGTLVPVSS',
 u'v': u'IGHV4-39*07'}

Above, redundancy refers to how many times we see a given sequence (seq) in a particular study. We also store the IMGT-numbered data (the data attribute) which needs a second round of json parsing and its output is a dictionary of IMGT-number – amino acid associations grouped by the regions of an antibody (cdrs and framework regions):

{u'cdrh1': {u'27': u'G',
            u'28': u'G',
            u'29': u'S',
            u'30': u'I',
            u'31': u'S',
            u'34': u'S',
            u'35': u'S',
            u'36': u'S',
            u'37': u'Y',
            u'38': u'Y'},
 u'cdrh2': {u'56': u'I',
            u'57': u'S',
            u'58': u'Y',
            u'59': u'S',
            u'63': u'G',
            u'64': u'T',
            u'65': u'T'},
 u'cdrh3': {u'105': u'A',
            u'106': u'R',
            u'107': u'H',
            u'108': u'Q',
            u'109': u'G',
            u'110': u'V',
            u'111': u'Y',
            u'111A': u'W',
            u'112': u'T',
            u'112A': u'T',
            u'112B': u'V',
            u'113': u'A',
            u'114': u'G',
            u'115': u'L',
            u'116': u'S',
            u'117': u'H'},
 u'fwh1': {u'11': u'G',
           u'12': u'L',
           u'13': u'V',
           u'14': u'K',
           u'15': u'P',
           u'16': u'S',
           u'17': u'E',
           u'18': u'T',
           u'19': u'L',
           u'20': u'S',
           u'21': u'L',
           u'22': u'T',
           u'23': u'C',
           u'24': u'T',
           u'25': u'V',
           u'26': u'S'},
 u'fwh2': {u'39': u'W',
           u'40': u'G',
           u'41': u'W',
           u'42': u'I',
           u'43': u'R',
           u'44': u'Q',
           u'45': u'P',
           u'46': u'P',
           u'47': u'G',
           u'48': u'Q',
           u'49': u'G',
           u'50': u'L',
           u'51': u'E',
           u'52': u'W',
           u'53': u'I',
           u'54': u'G',
           u'55': u'S'},
 u'fwh3': {u'100': u'A',
           u'101': u'V',
           u'102': u'Y',
           u'103': u'Y',
           u'104': u'C',
           u'66': u'Y',
           u'67': u'Y',
           u'68': u'N',
           u'69': u'P',
           u'70': u'S',
           u'71': u'L',
           u'72': u'K',
           u'74': u'S',
           u'75': u'R',
           u'76': u'V',
           u'77': u'T',
           u'78': u'I',
           u'79': u'S',
           u'80': u'V',
           u'81': u'D',
           u'82': u'T',
           u'83': u'S',
           u'84': u'K',
           u'85': u'N',
           u'86': u'Q',
           u'87': u'F',
           u'88': u'S',
           u'89': u'L',
           u'90': u'K',
           u'91': u'L',
           u'92': u'S',
           u'93': u'S',
           u'94': u'V',
           u'95': u'T',
           u'96': u'A',
           u'97': u'A',
           u'98': u'D',
           u'99': u'T'},
 u'fwh4': {u'118': u'W',
           u'119': u'G',
           u'120': u'Q',
           u'121': u'G',
           u'122': u'T',
           u'123': u'L',
           u'124': u'V',
           u'125': u'P',
           u'126': u'V',
           u'127': u'S',
           u'128': u'S'}}

We hope this quick intro to our data format will allow you to do great science with this data.

Prague Protein Spring 2018

We, Constantin and Dominik, the newest members of OPIG (SABS rotation students, as usual) were lucky to have a conference suitable to our research within our rotation period and, granted an allowance from the powers that be, were able to visit this year’s Prague Protein Spring with the topic ‘Proteins at Work’. There, we spent four busy but very inspirational days with about 50 participants in a little palace, the Vila Lanna.

The general topic of this meeting led to a broad variety of talks representing a multitude of fields of protein research: from origins of life, over fuzzy intrinsically disordered proteins and crowded cells to metagenomics and functional sequence alignment annotation.

We picked four thought engaging talks to present at the group meeting on 08/05/2018; here are their summaries:

Protein engineering and in vitro evolution studies for the origins of life

Kosuke Fujishima from Tokyo Institute of Technology presented several examples of the research he conducts in the area of origins of life. Research on the origins of life are generally based around the questions how prebiotic monomers were created, how they condensed into polymers and how functionality emerged within these polymers.

The first example of his research deals with the condensation of prebiotic monomers on the ocean-earth crust-interface. Water cycling between the ocean and the outer layers of the earth’s core provided an environment of high pressure and high temperatures (80 – 200 °C) which is necessary for amino acid polymerisation. The mineral Olivine was found to attract amino acids to its surface and the serpentinisation reaction happening with Olivine might provide the necessary wet/dry cycle. Therefore, the researchers built a reactor aiming to investigate this potential polymerisation mechanism. They found that with providing six prebiotic amino acids, 28 out of 36 possible dipeptides could be found in the reactor. Furthermore, up to 10-mer linear polypeptides could be detected as well, providing evidence for a mechanism of early earth’s generation of polypeptides [unpublished].

The second project showed that both enzymes, CysE/CysK, responsible for the current production of cysteine from serine, could be re-engineered to contain no cysteine in their sequence. Interestingly, cysteine-free CysE showed higher reaction rates than the wild type. Additional reduction to cysteine- and methionine-free enzyme sequences only worked for CysE but not for CysK.[Fujishima et al. (2018)] Still, the experiments indicate that an enzyme world could have existed with a reduced number of amino acids compared to the 20(+) amino acids that we know today.

The third project we wanted to point out used a type of mRNA display that not only links the genotype (mRNA) with its corresponding phenotype (translated protein) but also allows the translated protein to interact with a randomised, non-translated part of the mRNA. This provided a framework for investigating the evolution of ribonucleotide-binding (RNP) proteins. When selecting for ATP-binding, it was observed that protein together with RNA had the best fitness landscape compared to protein selection or RNA selection alone. Further analysis revealed that most binding affinity of the ribonucleotide protein stemmed from its RNA part.[unpublished] These results give rise to the suggestion that RNA and proteins co-evolved, opposing the idea of a pure RNA world.

RNA-protein interactions and the structure of the genetic code

The next speaker added more to the research area of RNA-protein interaction and evolution. Bojan Zagrovic from the University of Vienna presented his research around the finding that pyrimidine (PYR) density of RNA regions is correlated with the corresponding protein region’s affinity to pyrimidine-containing bases (running means of 21 amino acids or 63 bases were used), with the highest correlation between mRNA PYR density and guanine affinity, having an average ‘typical’ Pearson correlation coefficient of 0.80.[Polyansky & Zagrovic (2013)]

This correlation is specific for the current genetic code, shown by random generation of genetic codes which could not reproduce such a correlated behaviour and by looking into three organisms with very different codon usage bias (homo sapiens, E. coli, M. jannaschii). Even though the three averages of codon usage were very different, the highest correlating pairs of mRNA and cognate proteins clustered together, having very similar codon usage. This was also true for the worst correlating pairs.[Hlevnjak & Zagrovic (2015)]

But the big question being: what does this correlation imply functionally?

Annotation analysis revealed that the highest correlating pairs were enriched in nucleotide-binding functions and intrinsically disordered proteins. Without claiming generalisability, Professor Zagrovic pointed out a case study done on RNA polymerase II which has a long disordered C-terminus build up by 26 repeats of a 7 amino acid motif. 248 RNAs were found to interact with RNA polymerase II and in all three reading frames of the interacting RNAs, amino acid codons of the polymerase’s C-terminus were enriched.[unpublished]

This indicates some regulation over gene expression but also several other hypotheses were made: the correlation between the protein regions’ affinity for their cognate mRNA regions might be relevant in virus assembly, since coding RNA and translated proteins have to be in close proximity with each other. The same could be true for some non-membrane-bound compartments, e.g. P-bodies. Or is this correlation characteristic a hint to mRNAs acting as chaperones for their respective proteins? The functional implications of this correlation, while highly speculative, nevertheless suggest exciting research to come in the future.

Fuzziness in protein assemblies

Research from a different, but equally thought provoking field was presented by Mónika Fuxreiter from the University of Debrecen. Her talk on the concept of fuzziness in protein complexes, which she introduced 10 years ago [Tompa & Fuxreiter (2008)], shed light on some more recent developments in the field as well as explaining the underlying concept for those of us (ourselves included) who have not encountered the concept as such before.

Fuzziness in the context of protein complexes describes a phenomenon in which intrinsically disordered proteins, instead of folding upon binding as one would usually observe, can sample several conformational states with different propensities, leading to the sampled states contributing with different strengths to the function of the protein complex and further leading to varying degrees of disorder in the bound state.

This observation has several implications for the understanding of the functionality of disordered proteins, since the relative propensity for different ensemble states in the bound form is thought to be highly susceptible to milieu influences, such as tissue specific splicing and post-translational modifications. Fuzziness (a term that was borrowed from the mathematical theory of fuzzy sets) could thus be a driver of functional adaptability of disordered proteins to cell-cycle stage, environmental influences or tissue type.

Evidence for fuzziness has been curated by the Fuxreiter group since 2015 [Miskei et al. 2017] in the FuzDB database and recently been used to develop a prediction algorithm [unpublished], that according to Professor Fuxreiter achieves highly accurate predictions of fuzziness on a comprehensive validation dataset.

Both the implications of fuzziness for the understanding of the mode of action for disordered proteins (and disordered regions in otherwise ordered proteins) certainly spiked our interest, not least due to the potential importance of a clear understanding of these mode of actions for drug development.

Investigation of mutually exclusive splicing events using the CATH FunFam framework

The last of the 4 talks we would like to single out in this blogpost highlighted recent progress in using structure-based databases for the investigation of complex cellular events.

Christine Orengo from UCL presented her group’s work on mutually exclusive splicing, which employed the FunFam framework of the CATH database to probe the structural and functional implications of these splicing events [Lam et al. (2018), under review].

The FunFams are a subcategory of CATH’s homologous superfamilies, which further divides the superfamilies based on clusters of residue conservation within each family, thus creating groupings of functionally related proteins [Rentzsch & Orengo (2013)].

Mutually exclusive splicing that were investigated using this framework are a group of splicing events in which only one of several specific exons is present in the spliced mRNA. These exons usually show a high level of sequence similarity, leading to a low disruption of the protein structure by the splicing event. It is thought that this feature is a reason for the relative enrichment of mutually exclusive exons amongst alternative splicing events in the proteome.

This high degree of sequence similarity further enabled the mapping of the mutually exclusive exons to FunFams in the CATH database and thus further onto protein structures. This allowed the Orengo group to conduct a ‘large scale systematic study of the structural/functional effects of MXE splicing’.

Their analysis found that variable residues between the exons are significantly enriched at the protein surface, both compared to other stretches of the protein sequence and compared to non-variable residues in the exons, and in close proximity (< 6 Angstroms) to functional sites of the protein.

The main conclusion drawn from these findings was that, as previously hypothesised, mutually exclusive exons are likely functional switches, since changes in the surface exposed area close to functional sites are likely to affect the protein function without strongly disrupting its structure.

In the eyes of the Orengo group, this makes these splicing events good candidates for drug targeting, particularly in cases where a tissue specific isoform can be drugged, since in that case off-target effects could potentially be significantly reduced.

Sources:

Fujishima et al. (2018). Reconstruction of cysteine biosynthesis using engineered cysteine-free enzymes. Scientific Reports

Hlevnjak & Zagrovic (2015). Malleable nature of mRNA-protein compositional complementarity and its functional significance. Nucleic Acids Res

Lam, S. D., Orengo, C., & Lees, J. (2018). Protein structure and function analyses to understand the implication of mutually exclusive splicing. BioRxiv

Miskei, M. et al (2017). FuzDB: Database of fuzzy complexes, a tool to develop stochastic structure-function relationships for protein complexes and higher-order assemblies. Nucleic Acids Research

Polyansky & Zagrovic (2013). Evidence of direct complementary interactions between messenger RNAs and their cognate proteins. Nucleic Acids Res

Rentzsch, R., & Orengo, C. A. (2013). Protein function prediction using domain families. BMC Bioinformatics

Tompa, P., & Fuxreiter, M. (2008). Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends in Biochemical Sciences

Neuronal Complexity: A Little Goes a Long Way…To Clean My Apartment

The classical model of signal integration in both biological and Artificial Neural Networks looks something like this,

$f(\mathbf{s})=g\left(\sum\limits_i\alpha_is_i\right)$

where $g$ is some linear or non-linear output function whose parameters $\alpha_i$ adapt to feedback from the outside world through changes to protein dense structures near to the point of signal input, namely the Post Synaptic Density (PSD). In this simple model integration is implied to occur at the soma (cell body) where the input signals $s_i$ are combined and broadcast to other neurons through downstream synapses via the axon. Generally speaking neurons (both artificial and otherwise) exist in multilayer networks composing the inputs of one neuron with the outputs of the others creating cross-linked chains of computation that have been shown to be universal in their ability to approximate any desired input-output behaviour.

See more at Khan Academy

Models of learning and memory have relied heavily on modifications to the PSD to explain modifications in behaviour. Physically these changes result from alterations in the concentration and density of the neurotransmitter receptors and ion channels that occur in abundance at the PSD, but, in actuality these channels occur all along the cell wall of the dendrite on which the PSD is located. Dendrites are something like a multi-branched mass of input connections belonging to each neuron. This begs the question as to whether learning might in fact occur all along the length of each densely branched dendritic tree.

Continue reading →

The Ten Commandments of OPIG

In OPIG one must learn, and one must learn FAST! However, sometimes stupidity in OPIG knows no limits (*cough* James *cough* Anne *cough*), so for the newer (and prospective) members of the group, I thought it wise to share the some ground rules, a.k.a. The Ten Commandments of OPIG.

Vaguely adhering to these will drastically improve your time in OPIG (see Exhibit A), and let’s face it, none of them are particularly challenging.

No touchy the supervisor.
No touchy other students.
You’re not late unless you’re after Charlotte. Don’t be late.
All prizes are subject to approval by The Party.
Thou shalt not tomate.
Any and all unattended food is fair game.
Meetings (especially the one before yours) will go on as long as they have to.
Finish your DPhil or die.
This is not a democracy.
NO TOUCHY THE SUPERVISOR!

Bonus (and final rule). If this is your first time at Group Meeting, you have to present (well at least introduce yourself).

P.s. we’re not that bad, I promise!

Disclaimer: while I’ve categorised this post as “humour”, I take no responsibility for your enjoyment.

My experience with (semi-)automating organic synthesis in the lab

After three years of not touching a single bit of glassware, I have recently donned on the white coat and stepped back into the Chemistry lab. I am doing this for my PhD project to make some of the follow-up compounds that my pipeline suggests. However, this time there is a slight difference – I am doing reactions with the aid of a liquid handler robot, the Opentrons. This is the first encounter that I have with (semi-)automated synthesis and definitely a very exciting opportunity! (Thanks to my industrial sponsor, Diamond Light Source!)

A picture of the Opentrons machine I have been using to do some organic reactions. Picture taken from https://opentrons.com/robots.

Opentrons is primarily used by biologists and their goal is to make a platform to easily share protocols and reproduce each other’s work (I think we can all agree how nice this would be!). They provide a very easy to use API, wishing it to be accessible to any bench scientist with basic computer skills. From my experience so far, this has been the case as I found it extremely easy to pick up and write my own protocols for chemical reactions. Here is the command that will: (1) pick up a new pipette tip; (2) transfer a volume from source1 to destination1; (3) drop the pipette tip in the trash; (4) pick up a new pipette tip; (5) transfer a volume from source2 to destination2; (5) drop the pipette tip in the trash.

pipette.transfer(volume, [source1, source2], [destination1, destination2], new_tip=’always')

But of course not everything is plain sailing – there are many challenges you will encounter by using an automated pipette. The robot is a liquid handler – it cannot handle solids so either the solids need to be pre-weighed and/or made into solution beforehand. Further difficulties lie within the properties of the solvent it is handling, for example:

Dripping – low boiling point solvents tend to drip more.
Viscosity of liquids causes issues with not drawing up the correct amount of liquid – more viscous liquids require longer times to aspirate and if aspiration is too quick then air pockets may be drawn up.

Here is a GIF I made of a dry run I was doing with the robot (sorry for the slight shake, this was recorded on my phone in the lab… See their website for professional footage of the robot!)

My (shaky) footage of a dry run I was performing with the Opentrons.

The Curious Case Of A Human Chimera

In my role as a PhD student in the OPIG group, I integrate and analyse data from various biological, chemical and data sources. As I am interested in the intersection between chemistry, biology and daily life, it seems suitable that my next BLOPIG posts will discuss and highlight how biological phenomena have either influenced law or history.

Connection between Law and Biology – The Curious Case Of A Human Chimera
Our scene opens in a dark lab, where a scientist injects himself with an unknown substance. The voice over notes that they created a monster named “Chimera” while searching for their hero “Bellerophon”. This scene is the famous opening scene of the movie “Mission Impossible II” , where we are introduced to the dangerous bioweapon “Chimera”, a combination of multiple diseases. As “Chimera” is a mythological beast from Ancient Greek mythology, with a lion’s head, a goat’s body, and a serpent’s tail, the naming of this bioweapon seems appropriate.

What does this dangerous mixture of multiple diseases, an ancient mythological monster and the promised connection between law and biology have in common?

Apart from a really bad joke, the term “Chimera” is an actual term in biology to describe a biological entity of multiple diverse components, e.g. a human organism, whose cells are composed of distinct genotypes.
In case of tetragametic chimerism, human chimeras thus possess forty-six chromosome pairs instead of the “usual” set of twenty-six chromosome pairs, and as such, their organs and tissues are constructed according to the DNA outlined in the respective organ or tissue.
Tetragametic chimerism occurs by the fertilization of two ova by two spermatozoa, which develop into zygotes. These zygotes then subsequently fuse into one organism, which continues to develop into an organism with two sets of DNA.^1-2

But how did such a biological phenomenon like a chimera enter the court of law?

The Romans famously defined that the mother of a child is the one who gives birth to it (Mater sempre certa est, which can be translated as “The mother is always certain”). I would like to point out that in the times of in-vitro fertilization, this principle is no longer viable, since a child can now have both a genetic mother and a birth mother.³This Principle was disproved in 2002, when Lydia Fairchild applied to receive Welfare for her two children and her third, unborn child, from the US State. Paternity tests were conducted on all children to prove her ex-partner’s paternity. While the tests proved the paternity of the father without a doubt, Lydia was shown to be no genetic match to her children.

Accused of being a “wellfare fraud” or a surrogate, the judge ordered that Lydia Fairchild had to give birth to her third child in front of witnesses. Immediately blood samples were taken, which revealed that Lydia Fairchild also did not share DNA with this child, despite giving birth to it. Now accused of being a surrogate, Lydia’s case looked dire.
Fortunately, Lydia’s lawyer read a journal article about a similar case involving a woman named Kareen Keegan.^{2, 4-5} Karen, a 52-year old woman, had renal failure. As she needed a kidney replacement, Karen’s sons underwent the histocompability process to test for donation.Yet the genetic tests showed that only one of her three sons was related to her.¹ Material from her entire body was tested for genetic matches to her sons’ DNA, but only genetic material of her thyroid matched her sons.²Ultimately, the researchers concluded that Karen was a tetragametic chimera, born of the fusion of her zygote and her twin sibling in her mother’s womb. As Dr. Lynne Uhl, a pathologist and doctor of transfusion medicine at Beth Israel Deaconess Medical Center in Boston, said:
“In her blood, she was one person, but in other tissues, she had evidence of being a fusion of two individuals.”⁶

Subsequently, scientists collected Lydia’s cell material from various body parts and tested for a genetic match with her children. The DNA from her cervical smear was found to be a match, while the DNA collected from her skin and hair was not. Additionally, DNA samples from Lydia’s mother matched her childrens’ DNA.^4-5

Interestingly, while both Lydia and Karen were carrying two sets of DNA as a result of prenatal fusions with their twins, they didn’t show any phenotypic sign of being a chimera, e.g. different skin types or the so-called Blaschko lines.^7-8

https://www.scientificamerican.com/article/3-human-chimeras-that-already-exist/
To, E. & Report, C. LEADING TO IDENTIFICATION OF TETRAGAMETIC CHIMERISM. 346, (2002).
https://en.wikipedia.org/wiki/Mater_semper_certa_est
https://pictorial.jezebel.com/one-person-two-sets-of-dna-the-strange-case-of-the-hu-1689290862
https://web.archive.org/web/20140301211020/http://www.essentialbaby.com.au/life-style/nutrition-and-wellbeing/when-your-unborn-twin-is-your-childrens-mother-20140203-31woi.html
http://abcnews.go.com/Primetime/shes-twin/story?id=2315693
https://jamanetwork.com/journals/jamadermatology/fullarticle/419529
http://biologicalexceptions.blogspot.co.uk/2015/09/when-youre-not-just-yourself.html

All links were last viewed on the 24.04.2018.

My next blog post: Can a mismatch in maternal DNA threaten a government? How Biology can Influence History.

Oxford Protein Informatics Group

or "OPIG" to friends