Author Archives: Konrad Krawczyk

How to parse OAS data

We have recently released the Observed Antibody Space database – collection of cleaned and annotated antibody sequence (Ig-seq or AIRR-seq) data from 53 studies. We have formatted the data in the format that should facilitate data mining and since release we had several queries on how to parse the data out. Therefore here we give a small example of how to parse the data and make sense of it.

You should download the bulk data file from OAS, available here.

The datasets are separated into ‘data units‘  – collections of sequences that can be uniquely assigned to a range of metadata parameters such as study, organism etc. Our task therefore is to iterate through all those files and read sequences from each of these. Firstly we will attempt to iterate through the files and I will assume that you uncompressed the bulk data file into ../data/json folder. We will write a helper function that will simply list all files with its full paths in a directory and call it list_file_paths

import os

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        print f

The code above will list all the files in ../data/json which incidentally are all the ‘data units’. Now our task is to parse out the output from each of the data units. They are gzipped files with data element on each line. Therefore we will use gzip library to stream the contents of a gzipped file rather than uncompressing each one of them separately. This is achieved by function parse_single_file

import os,gzip

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        print line
    

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The code above will simply go through all the data unit files, stream the gzipped lines and print each one of them separately. Each line however is formatted as json – meaning it can be parsed using pythons json library and act as a pythonic dictionary. below we have parsed out the basic elements in the final incarnation of the code:

import os,gzip,json,pprint

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        if meta_line == True:
                metadata = json.loads(line)
                meta_line=False
                print "Metadata:"
                pprint.pprint(metadata)
                continue
        #Parse actual sequence data.
        basic_data = json.loads(line)
        print "Basic data:"
        pprint.pprint(basic_data)

        #IMGT-Numbered sequence.
        print "IMGT-numbered sequence"
        d = json.loads(basic_data['data'])
        pprint.pprint(d)
        print "==========="
    
if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The first line of each data unit are meta entries. These look as follows:

{u'Age': u'22-70',
 u'Author': u'Halliley et al., (2015)',
 u'BSource': u'Bone-Marrow',
 u'BType': u'Plasma-B-Cells',
 u'Chain': u'Heavy',
 u'Disease': u'None',
 u'Isotype': u'IGHM',
 u'Link': u'https://doi.org/10.1016/j.immuni.2015.06.016',
 u'Longitudinal': u'no',
 u'Size': 934,
 u'Species': u'human',
 u'Subject': u'no',
 u'Vaccine': u'Tetanus/Flu'}

The attributes should be self-explanatory and the existence of this data on top of each file is supposed to streamline searching through data-units if you wish to parse sequences given a particular configuration of meta-data entries (e.g. organism).

Next, the code parses out data on each sequence that is associated with its genes, full sequence, CDR3 and numbered sequence. Therefore the output for this will look something like this:

{u'cdr3': u'ARHQGVYWVTTAGLSH',
 u'data': u'{"fwh1": {"11": "G", "24": "T", "13": "V", "12": "L", "15": "P", "14": "K", "17": "E", "16": "S", "19": "L", "18": "T", "22": "T", "26": "S", "25": "V", "21": "L", "20": "S", "23": "C"}, "fwh3": {"68": "N", "88": "S", "89": "L", "66": "Y", "67": "Y", "82": "T", "83": "S", "80": "V", "81": "D", "86": "Q", "87": "F", "84": "K", "85": "N", "92": "S", "79": "S", "69": "P", "104": "C", "78": "I", "77": "T", "76": "V", "75": "R", "74": "S", "72": "K", "71": "L", "70": "S", "102": "Y", "90": "K", "100": "A", "101": "V", "95": "T", "94": "V", "97": "A", "96": "A", "91": "L", "99": "T", "98": "D", "93": "S", "103": "Y"}, "fwh2": {"52": "W", "39": "W", "48": "Q", "49": "G", "46": "P", "47": "G", "44": "Q", "45": "P", "51": "E", "43": "R", "40": "G", "42": "I", "55": "S", "53": "I", "54": "G", "41": "W", "50": "L"}, "fwh4": {"120": "Q", "121": "G", "122": "T", "123": "L", "124": "V", "125": "P", "126": "V", "127": "S", "128": "S", "119": "G", "118": "W"}, "cdrh1": {"27": "G", "37": "Y", "31": "S", "30": "I", "28": "G", "29": "S", "35": "S", "34": "S", "38": "Y", "36": "S"}, "cdrh2": {"59": "S", "58": "Y", "57": "S", "56": "I", "63": "G", "64": "T", "65": "T"}, "cdrh3": {"111A": "W", "109": "G", "108": "Q", "115": "L", "114": "G", "117": "H", "116": "S", "111": "Y", "110": "V", "113": "A", "112": "T", "112A": "T", "112B": "V", "106": "R", "107": "H", "105": "A"}}',
 u'j': u'IGHJ1*01',
 u'name': 12,
 u'redundancy': 1,
 u'seq': u'GLVKPSETLSLTCTVSGGSISSSSYYWGWIRQPPGQGLEWIGSISYSGTTYYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARHQGVYWVTTAGLSHWGQGTLVPVSS',
 u'v': u'IGHV4-39*07'}

Above, redundancy refers to how many times we see a given sequence (seq) in a particular study. We also store the IMGT-numbered data (the data attribute) which needs a second round of json parsing and its output is a dictionary of IMGT-number – amino acid associations grouped by the regions of an antibody (cdrs and framework regions):

{u'cdrh1': {u'27': u'G',
            u'28': u'G',
            u'29': u'S',
            u'30': u'I',
            u'31': u'S',
            u'34': u'S',
            u'35': u'S',
            u'36': u'S',
            u'37': u'Y',
            u'38': u'Y'},
 u'cdrh2': {u'56': u'I',
            u'57': u'S',
            u'58': u'Y',
            u'59': u'S',
            u'63': u'G',
            u'64': u'T',
            u'65': u'T'},
 u'cdrh3': {u'105': u'A',
            u'106': u'R',
            u'107': u'H',
            u'108': u'Q',
            u'109': u'G',
            u'110': u'V',
            u'111': u'Y',
            u'111A': u'W',
            u'112': u'T',
            u'112A': u'T',
            u'112B': u'V',
            u'113': u'A',
            u'114': u'G',
            u'115': u'L',
            u'116': u'S',
            u'117': u'H'},
 u'fwh1': {u'11': u'G',
           u'12': u'L',
           u'13': u'V',
           u'14': u'K',
           u'15': u'P',
           u'16': u'S',
           u'17': u'E',
           u'18': u'T',
           u'19': u'L',
           u'20': u'S',
           u'21': u'L',
           u'22': u'T',
           u'23': u'C',
           u'24': u'T',
           u'25': u'V',
           u'26': u'S'},
 u'fwh2': {u'39': u'W',
           u'40': u'G',
           u'41': u'W',
           u'42': u'I',
           u'43': u'R',
           u'44': u'Q',
           u'45': u'P',
           u'46': u'P',
           u'47': u'G',
           u'48': u'Q',
           u'49': u'G',
           u'50': u'L',
           u'51': u'E',
           u'52': u'W',
           u'53': u'I',
           u'54': u'G',
           u'55': u'S'},
 u'fwh3': {u'100': u'A',
           u'101': u'V',
           u'102': u'Y',
           u'103': u'Y',
           u'104': u'C',
           u'66': u'Y',
           u'67': u'Y',
           u'68': u'N',
           u'69': u'P',
           u'70': u'S',
           u'71': u'L',
           u'72': u'K',
           u'74': u'S',
           u'75': u'R',
           u'76': u'V',
           u'77': u'T',
           u'78': u'I',
           u'79': u'S',
           u'80': u'V',
           u'81': u'D',
           u'82': u'T',
           u'83': u'S',
           u'84': u'K',
           u'85': u'N',
           u'86': u'Q',
           u'87': u'F',
           u'88': u'S',
           u'89': u'L',
           u'90': u'K',
           u'91': u'L',
           u'92': u'S',
           u'93': u'S',
           u'94': u'V',
           u'95': u'T',
           u'96': u'A',
           u'97': u'A',
           u'98': u'D',
           u'99': u'T'},
 u'fwh4': {u'118': u'W',
           u'119': u'G',
           u'120': u'Q',
           u'121': u'G',
           u'122': u'T',
           u'123': u'L',
           u'124': u'V',
           u'125': u'P',
           u'126': u'V',
           u'127': u'S',
           u'128': u'S'}}

We hope this quick intro to our data format will allow you to do great science with this data.

Interesting Antibody Papers

Here we highlight two antibody papers, one from the past one more recent. The more recent one talks about developing an affinity maturation model. The older one is a refresher on the Developability Index — how to computationally harness hydrophobicity and accessible surface areas to predict aggregation.

Mouse antibody maturation model — the most expanded (common) clones might not be the ones with highest affinities here (van Kampen lab). The authors of the paper define a model of affinity maturation. The main take-home message of the paper is that the ‘most expanded’ clones might not be the ones with highest affinity — expanded clones are assumed to be the ones ‘responding’ to the antigenic challenge. The model is based on Ordinary Differential Equations, tracing cell fate in a germinal center. The model was compared to experimental expansion data from lymph nodes for accuracy. In each such model one needs to assume a lot of parameters, such as which day post-immunization do we start somatic hypermuatation? The paper is a very nice example of a model of maturation and a good starting point for tracing references citing germinal center biology and numbers for parameters used for models (also the general canon of construction of such models!).

Developability index here. (Trout lab at MIT). The authors touch on a very important subject of antibody developability: after you produced your ab binder, does it have physicochemical characteristics which are suitable to carry on with it as a therapeutic. Such characteristics include stability, expression yields and aggregation propensity. Aggregation propensity is one of the most important factors here as it affects the pharmacokinetics of the drug as well as shelf life. In this manuscript, authors address attempt to predict the aggregation propensity of antibodies. As background data, they use twelve antibodies whose long term stability has been measured over several years. To develop a computational method to predict antibody aggregation propensity, they use a score which combines hydrophobicity and electrostatic factors. The hydrophobicity is an adapted SAP score which the authors developed previously, and whose main parameters are the exposed residue area and hydrophobicity of the residue as defined by Black and Mould. The electrostatics are calculated using PROPKA. Since combining the scores into a predictive model involved parametrization, they use seven of the antibodies to adjust the coefficients. They use the rest to demonstrate that their model has predictive power. Calculation of their models requires a structure of an antibody which they obtain using WAM. Take home messages? It is a nice dataset to play with aggregation prediction and it demonstrates how to calculate electrostatics and hydrophobicity of a molecule.

 

Interesting Antibody Papers

This time round, one older paper, one recent paper. The older one talks about estimating how many H3s can there be in a human body based on sequencing of two individuals (they cap it at 9 million — not that much!). The more recent one is an attempt to define what makes a good antibody in terms of its developability properties (a battery of biophys assays on 150 therapeutic antibodies- amazing dataset to work with).

High resolution description of antibody heavy chain repertoires in (two) humans (Koralov lab at NYU). Here. Two individuals were sequenced and their VDJ frequencies measured. It is widely believed that the VDJ recombination events are largely independent and random. Here however they demonstrate some biases/interplay between the D and J regions. Since H3 falls on the VDJ junction, it might suggest that it affects the total choice of H3. Another quite important point is that they compared the productive vs nonproductive sequences (out of frame or with stop codons). If there were significant differences between the VDJ frequencies of productive vs nonproductive sequences, it would suggest selection at the VDJ frequency stage. However they do not see any significant differences here suggesting that VDJ combinations have little bearing on this initial selection step. Finally, they estimate the number of H3 in repertoire. The technique is interesting — they sample 1000 H3s from their set and see how many unique sequences it contributes. Each next sample contributes less and less unique sequences which leads to a log-decay curve. By doing so they get a rough estimate of when there will be no more new sequences added and hence an estimate of diversity (think why do this rather than counting the number of uniques!). They allow themselves to extrapolate this estimate to the whole organism by multiplying their blood sample by the total human body volume — they motivate this extrapolation by the fact that there were precious little overlaps between the two human subjects.

Biophysical landscape of clinical stage antibodies [here]. Paper from Adimab. Designing an antibody which binding its target is only a first step on the way to bring the drug on the market. The molecule needs to fulfill a variety of characteristics such as colloidal stability (does not aggregate or ‘clump up’), does not instantly clear from the organism (which is usually down to off target binding), is stable and can be expressed in reasonable quantities. In an effort to delineate what makes a good antibody, the authors take inspiration from earlier work on small molecules, namely the Lipinski Rules of Five. This set of rules describes what makes a ‘good’ small molecule drug, which was assessed by looking at ~2000 therapeutic drugs. The rules came down to certain numbers of hydrogen bond donors, acceptors, molecular weight & lipophilicity. Therefore, Jain et al would like a similar methodology, but for antibodies: give me an antibody and using methodology/rules we define, we will tell you either to carry on with development or maybe not. Since antibodies are far more complex and the data on therapeutic abs orders of magnitude smaller (around 50 therapeutic abs to date) Jain et al, had to devise a more nuanced approach than simply counting hb donors/acceptors mass etc. The underlying ‘good’ molecule data though is similar: they picked therapeutic antibodies and those in late clinical testing stages (2,3). This resulted in ~150 antibodies. So as to devise the benchmark ‘rules/methodology’, they went for a battery of assays to serve as a benchmark — if your ab raises too many red flags according to these assays, it’s not great (what constitutes a red flag to be defined). These assays were supposed to not be obscure and relatively easy to use as the point was that an arbitrary antibody can be relatively easy checked against them. The assays are a range of expression, cross reactivity, self reactivity, thermal stability etc. To define red flags, they run their therapeutic/clinical antibodies through the tests. To their surprise quite a lot of these molecules turn out to have quite ‘undesirable characteristics’. Following the Lipinski Rules, they define a red flag as being in the bottom 10th percentile of the assay values as evaluated on the therapeutic abs. They show that the antibodies which are approved or in more advanced clinical trials stages have less red flags. Therefore, the take-home messages from this paper: very nice dataset for any computational work, raising red flags does not disqualify you from being a therapeutic.

Interesting Antibody Papers

Hints how broadly neutralizing antibodies arise (paper here). (Haynes lab here) Antibodies can be developed to bind virtually any antigen. There is a stark difference however between the ‘binding’ antibodies and ‘neutralizing’ antibodies. Binding antibodies are those that make contact with the antigen and perhaps flag it for elimination. This is in contrast to neutralizing antibodies, whose binding eliminates the biological activity of the antigen. A special class of such neutralizing antibodies are ‘broad neutralizing antibodies’. These are molecules which are capable of neutralizing multiple strains of the antigen. Such broadly neutralizing antibodies are very important in the fight against highly malleable diseases such as Influenza or HIV.

The process how such antibodies arise is still poorly understood. In the manuscript of Williams et al., they make a link between the memory and plasma B cells of broadly neutralizing antibodies and find their common ancestor. The common ancestor turned out to be auto-reactive, which might suggest that some degree of tolerance is necessary to allow for broadly neutralizing abs (‘hit a lot of targets fatally’). From a more engineering perspective, they create chimeras of the plasma and memory b cells and demonstrate that they are much more powerful in neutralizing HIV.

Ineresting data: their crystal structures are different broadly neutralizing abs co-crystallized with the same antigen (altought small…). Good set for ab-specific docking or epitope prediction — beyond the other case like that in the PDB (lysozyme)! At the time of writing the structures were still on hold in the PDB so watch this space…

Interesting Antibody Papers

Below are two somewhat recent papers that are quite relevant to those doing ab-engineering. The first one takes a look at antibodies as a collection — software which better estimates a diversity of an antibody repertoire. The second one looks at each residue in more detail — it maps the mutational landscape of an entire antibody, showing a possible modulating switch for VL-CL interface.

Estimating the diversity of an antibody repertoire. (Arnaout Lab) paper here. High Throughput Sequencing (or next generation sequencing…) of antibody repertoires allows us to get snapshots of the overall antibody population. Since the antibody population ‘diversity’ is key to their ability to find a binder to virtually any antigen, it is desirable to quantify how ‘diverse’ the sample is as a way to see how broad you need to cast the net. Firstly however, we need to know what we mean by ‘diversity’. One way of looking at it is akin to considering ‘species diversity’, studied extensively in ecology. For example, you estimate the ‘richness’ of species in a sample of 100 rabbits, 10 wolves and 20 sheep. Diversity measures such as Simpson’s index or entropy were used to calculate how biased the diversity is towards one species. Here the sample is quite biased towards rabbits, however if instead we had 10 rabbits, 10 wolves and 10 sheep, the ‘diversity’ would be quite uniform. Back to antibodies: it is desirable to know if a given species of an antibody is more represented than others or if one is very underrepresented. This might indicate healthy vs unhealthy immune system, indicate antibodies carrying out an immune response (when there is more of a type of antibody which is directing the immune response). Problem: in an arbitrary sample of antibody sequences/reads tell me how diverse they are. We should be able to do this by estimating the number of cell clones that gave rise to the antibodies (referred to as clonality). People have been doing this by grouping sequences by CDR3 similarity. For example, sequences with CDR3 identical or more than >95% identity, are treated as the same cell — which is tantamount to being the same ‘species’. However since the number of diverse B cells in a human organism is huge, HTS only provides a sample of these. Therefore some rarer clones might be underrepresented or missing altogether. To address this issue, Arnaout and Kaplinsky developed a methodology called Recon which estimates the antibody sample diversity. It is based on the expectation-maximization algorithm: given a list of species and their numbers, iterate adding parameters until they have a good agreement between the fitted distributions and the given data. They have validated this methodology firstly on the simulated data and then on the DeKosky dataset. The code is available from here subject to their license agreement.

Thorough analysis of the mutational landscape of the entire antibody. [here]. (Germaine Fuh from Affinta/Genentech/Roche). The authors aimed to see how malleable the variable antibody domains are to mutations by introducing all possible modifications at each site in an example antibody. As the subject molecule they have used high-affinity, very stable anti-VEGF antibody G6.31. They argue that this antibody is a good representative of human antibodies (commonly used genes Vh3, Vk1) and that its optimized CDRs might indicate well any beneficial distal mutations. They confirm that the positions most resistant to mutation are the core ones responsible for maintaining the structure of the molecule. Most notably here, they have identified that Kabat L83 position correlates with VL-CL packing. This position is most frequently a phenylalanine and less frequently valine or alanine. This residue is usually spatially close to isoleucine at position LC-106. They have defined two conformations of L83F — in and out:

  1. Out: -50<X1-100 interface.
  2. In: 50<X1<180

Being in either of these positions correlates with the orientation of LC-106 in the elbow region. This in turn affects how big the VL-CL interface is (large elbow angle=small  tight interface; small elbow angle=large interface). The L83 position often undergoes somatic hypermutation, as does the LC-106 with the most common mutation being valine.

Interesting Antibody Papers.

Below are several antibody papers that should be of interest to those dealing with antibody engineering, be it computational or experimental. The running motif in this post will be humanization, or the process of engineering a mouse antibody sequence which binds to a target to look ‘more human’ so as to reduce the immune response (if you need an early citation on this issue, here it is).

We present two papers which talk about antibody humanization directly, one from structural point of view (Choi et al. 2015), the other one highlighting issues facing antibody engineers  mining for information (Martin & Rees, 2016). The third paper (Collins et al. 2015) takes a step back from the issues presented in the other papers and talks broadly about the nature of mouse sequences raised in the lab.

Humanization via structural means [here] (Bailey-Kellogg group). The authors introduce a novel methodology named CoDAH to facilitate humanization of antibodies. They design an approach which makes a tradeoff between sequence and structural humanization scores. The sequence score used is the Human String Content (Laza et al. 2007, Mol Immunol), which calculates how similar the query (murine) sequence is to short stretches of human sequences (mostly germilne). In line with the fact that T-Cells are one of main drivers of anti-biologics immunity, they define the sequences stretches to be 9-mer, as recognized by T-Cells. For the structural score, they use Rotameric energy as calculated by Amber. They demonstrate that constructs designed using their score express and retain affinity towards the target antigen, however they do not appear to prove that the new sequences are not immunogenic.

Extracting data from databases for humanization [here] (Martin group and Rees consulting). The main purpose of this manuscript is to warn potential antibody engineers of the pitfalls of species mis-annotations. They point out that in a routine ‘humanization’ pipeline where we aim to find human sequences given a mouse sequence, a great number of seemingly good ‘human’ templates are not human at all (sources as diverse as IMGT or PDB). This might lead to errors down the line if the engineer does not double check the annotations (unfortunate but true). Many of such annotations arise because the cells in which mouse antibodies are expressed are human cells or because the sequences are chimeric — in either case the annotation would not read mouse or chimeric, but erroneously ‘human’. NB. Another thing to watch in this publication is the fact that authors are working on a sequence database of their own: EMBLIG which is said to collect data from EMBL-ENA (nucleotide repository from EMBL). Hopefully in their database, authors will address the issues that they point out here.

What can we say about antibodies produces by laboratory mice? [here] (Collins group). Authors of this manuscript have addressed the issue that the now available High Throughput Sequencing (HTS) overlooked mouse repertoires. Different mice strains have different susceptibilities to diseases (Houpt, 2002, J Imunol; which might mean that you need to think twice which mice strain to choose for a given target). Currently known antibody repertoire of mice is based on the sequencing of two strains, BALB/c and C57BL/6. Here the authors apply HTS to two strains (BALB/c and C57BL/6) of laboratory mice (eight mice per strain) to get a better snapshot of antibody gene usage. Specifically, they pay close attention to the different genes combinations (VDJ) in the sequences that they obtain. Authors conclude that the repertoires between the two strains are strikingly different and quite restricted — which might mean that the laboratory mice were under very specific pressures (read inbred/overbred). All in all, the VDJ usage numbers that they produce in this publication are a useful reference to know which sequence combinations might be used by antibody engineers.

Interesting Antibody Papers

De Novo H3 prediction by C-terminal kink-biasing (Gray Lab) [here].

Authors introduce an improvement to the prediction of CDR-H3 in the form of a constraint for de-novo decoy generation. Working from the observation that 80% of CDR-H3 have kinked C-Terminal (Weitzner et al., 2015, Structure), they bias the loops to assume this conformation (they prove that it does not force ALL loops to do so!). The constraint is in the form of a pseudo bond angle between Ca for the three C-terminal residues and a pseudo dihedral angle for the three C-terminal residues and one adjacent residue in the framework. The bias takes the form of a penalty score if the generated angle falls outside mean +/- 1s. They use a quite stringent H3 loop benchmark of only 49 loops. Using this constraint on this dataset improves prediction for majority of the loops. They also demonstrate the utility of the score for full Fv homology modeling and Ab-Ag docking.

Therapeutic vs synthetic vs natural antibodies (Ofran Lab) [here].

The authors analyzed 137 Ab-Ag complexes from the PDB. Those from hybridoma and synthetic libraries were classified as ‘Natural’ and those coming from ‘synthetic’ libraries. They demonstrate that synthetic libraries overuse H3 in the number of contacts the antibody forms with the antigen, whereas natural constructs share the paratope with H1& H2 to a larger extent. This, together with their tool, CDRs analyzer (analysis of structural & biochemical properties of ab-ag complex) can be a useful method to inform the design of antibodies.

From the past: TABHU, tools for antibody humanization (Tramontano Lab) [here]. Authors have created a tool to aid antibody humanization. Given a sequence of an antibody, the system would look for the most suitable template from their extensive sequence databases (DIGIT) and germline sequences from IMGT. The templates are assessed on sequence similarity to the query and the similarity of the ‘binding’ mode which is assessed by their paratope prediction tool proABC. After the template had been chosen, the user can produce a structural model of the sequence.

What happens to the Human Immune Repertoire over time?

Last week during the group meeting we talked about a pre-print publication from the Ippolito group in Austin TX (here). The authors were monitoring the antibody repertoire from Bone Marrow Plasma Cells (80% of circulating abs) over a period of 6.5 years. For comparison they have monitored another individual over the period of 2.3 years. In a nutshell, the paper is like Picture 1 — just with antibodies

This is what the paper talks about in a nutshell. How the antibody repertoire looks like taken at different timepoints in an individual's lifetime.

This is what the paper talks about in a nutshell. How the antibody repertoire looks like taken at different timepoints in an individual’s lifetime.

.

The main question that they aimed to answer was: ‘Is the Human Antibody repertoire stable over time‘? It is plausible to think that there should be some ‘ground distribution’ of antibodies that are present over time which act as a default safety net. However we know that the antibody makeup can change radically especially when challenged by antigen. Therefore, it is interesting to ask, does the immune repertoire maintain a fairly stable distribution or not?

Firstly, it is necessary to define what we consider a stable distribution of the human antibody repertoire. The antibodies undergo the VDJ recombination as well as Somatic Hypermutation, meaning that the >10^10 estimated antibodies that a human is capable of producing have a very wide possible variation. In this publication the authors mostly focused on addressing this question by looking at how the usage of possible V, D and J genes and their combinations changes over time.

Seven snapshots of the immune repertoire were taken from the individual monitored over 6.5 years and two from the individual monitored over 2.3 years. Looking at the usage of the V, D and J genes over time, it appears that the proportion in each of the seven time points appears quite stable (Pic 2). Authors claim similar result looking at the combinations.  This would suggest that our antibody repertoire is biased to sample ‘similar’ antibodies over time. These frequencies were compared to the individual who was sampled over the period of 2.3 years and it appears that the differences might not be great between the two.

How the frequencies of V, D  and J genes change (not) over 6.5 years in a single individual

How the frequencies of V, D and J genes change (not) over 6.5 years in a single individual

It is a very interesting study which hints that we (humans) might be sampling the antibodies from a biased distribution — meaning that our bodies might have developed a well-defined safety net which is capable of raising an antibody towards an arbitrary antigen. It is an interesting starting point and to further check this hypothesis, it would be necessary to carry out such a study on multiple individuals (as a minimum to see if there are really no differences between us — which would at the same time hint that the repertoire do not change over time).

 

Convergent affinity maturation.

Antibodies are the first line of defense of our organisms against noxious substances. They are the proteins which we ‘train’ to recognize noxious substances when we get immunized. Therefore understanding the immune response after being presented with an antigen is instrumental in developing novel vaccines.

One hypothesis relating to immune response to an antigen is that different organisms are likely to raise similar or even identical antibodies against the same antigen. Testing this hypothesis has become more realistic recently with the advent of Next Generation Sequencing technologies (NGS). Using NGS techniques it is possible to interrogate the sequential makeup of a large set of B-cells.

Such a study was conducted not that long time ago by Trueck et al. They have analysed antibody repertoires of five individuals, pre and post immunization to check if the immune systems converged on similar antibody sequences. The five individuals were immunized with a conjugate vaccine of HiB, MenC and TT. The antibodies were sequenced from cells extracted pre-vaccination and seven days after vaccination.

Firstly, the antibody repertoire appeared to reflect the fact that an organism was mounting an immune response as the clonality post vaccination was higher than before vaccination (more cells producing similar antibodies). Secondly, authors focused on identifying sequences from the public repertoire — those antibodies that are shared between individuals. This analysis focused on the CDR3 only, of which 47 were shared between at least two of the five individuals. Quite a large proportion of those sequences were known to be specific towards HiB and the enrichment of these was muhch higher in the post-vaccination sample. Only one sequence in this set was known previously to target TT. Nevertheless, relaxing the sequence similarity condition, a lot of sequences related to those known to be TT-specific were found among the five individuals. Most importantly, the number of such sequences was much higher in the post-vaccination samples, indicating that these might indeed have been raised in response to TT stimulation. The same was not true for MenC as hardly any sequences related to this antigen were found in the immune response of the five individuals.

Therefore, authors claim that looking at the enrichment of such shared sequences can be an indicator of the effectiveness of the immune response. They correlate statistics coming from looking at the number of shared sequences which appear to have moderate correlation to the antibody avidity data (even though p-values in some cases are quite high). This indicates that even in such a small set of individuals, antibodies are capable of converging on similar solutions. This might provide clues as to the characteristics antibodies that recognize specific antigens and thus facilitate novel vaccine design.

 

 

 

Drawing Custom Unrooted Trees from Sequence Alignments

Multiple Sequence Alignments can provide a lot of information relating to the relationships between proteins. One notable example was the map of the kinome space published in 2002 (Figure 1).

 

Figure 1. Kinase space as presented by Manning et al. 2002;

Such images organize our thinking about the possible space of such proteins/genes going beyond long lists of multiple sequence alignments. The image in Figure 1, got a revamp later which now is the popular ‘kinome poster’ (Figure 2).

Revamped dendrogram of the kinome fro Fig. 1. Downloaded from http://i.imgur.com/BPLUvfc.png.

Here we have created a script to produce similar dendrograms straight from the multiple sequence alignment files (although clearly not as pretty as Fig 2!). It is not difficult to find software that would produce ‘a dendrogram’ from an MSA but making it do the simple thing of annotating the nodes with colors, shapes etc. with respect to the labels of the genes/sequences is slightly more problematic. Sizes might correspond to the importance of given nodes and colors can organize by their tree branches. The script uses the Biopython module Phylo to construct a tree from an arbitrary MSA and networkx to draw it:

python Treebeard.py
import networkx, pylab
from networkx.drawing.nx_agraph import graphviz_layout
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO

#What color to give to the edges?
e_color = '#ccccff'
#What colors to give to the nodes with similar labels?
color_scheme = {'RSK':'#e60000','SGK':'#ffff00','PKC':'#32cd32','DMPK':'#e600e6','NDR':'#3366ff','GRK':'#8080ff','PKA':'magenta','MAST':'green','YANK':'pink'}
#What sizes to give to the nodes with similar labels?
size_scheme = {'RSK':200,'SGK':150,'PKC':350,'DMPK':400,'NDR':280,'GRK':370,'PKA':325,'MAST':40,'YANK':200}

#Edit this to produce a custom label to color mapping
def label_colors(label):
	color_to_set = 'blue'
	for label_subname in color_scheme:
		if label_subname in label:
			color_to_set = color_scheme[label_subname]
	return color_to_set

#Edit this to produce a custom label to size mapping
def label_sizes(label):
	#Default size
	size_to_set = 20
	for label_subname in size_scheme:
		if label_subname in label:
			size_to_set = size_scheme[label_subname]
	return size_to_set

#Draw a tree whose alignment is stored in msa.phy
def draw_tree():
	
	#This loads the default kinase alignment that should be in the same directory as this script
	aln = AlignIO.read('agc.aln', 'clustal')
	#This will construct the unrooted tree.
	calculator = DistanceCalculator('identity')
	dm = calculator.get_distance(aln)
	constructor = DistanceTreeConstructor()
	tree = constructor.nj(dm)
	G = Phylo.to_networkx(tree)
	node_sizes = []
	labels = {}
	node_colors = []
	for n in G:
		label = str(n)
		if 'Inner' in label:
			#These are the inner tree nodes -- leave them blank and with very small sizes.
			node_sizes.append( 1 )
			labels[n] = ''
			node_colors.append(e_color)
		else:
			#Size of the node depends on the labels!
			node_sizes.append( label_sizes(label) )
			#Set colors depending on our color scheme and label names
			node_colors.append(label_colors(label))
			#set the label that will appear in each node			
			labels[n] = label
	#Draw the tree given the info we provided!
	pos = graphviz_layout(G)
	networkx.draw(G, pos,edge_color=e_color,node_size = node_sizes, labels=labels, with_labels=True,node_color=node_colors)
	#Showing	
	pylab.show()
	#Saving the image -- uncomment
	#pylab.savefig('example.png')

if __name__ == '__main__':
	
	draw_tree()

We are going to use the kinase alignment example to demonstrate how the script can be used. The kinase alignment we use can be found here on the kinase.com website. We load the alignment and construct the unrooted tree using the Bio.Phylo module. Note that on each line of the alignment there is a name. These names are the labels that we use to define the colors and sizes of nodes. There are two dummy functions that achieve that label_nodes() and label_sizes() — if you look at them it should be clear how to define your own custom labeling.

If you download the code and the alignment and run it by:

python Treebeard.py

You should see a similar image as in Fig 3.

Fig 3. Size-color-customized unrooted tree straight from a multiple sequence alignment file of protein kinases. Constructed using the script Treebeard.py