Monthly Archives: December 2013

Journal Club: Large-scale analysis of somatic hypermutations

This week I presented a paper by Burkovitz et al from Bar Ilan University in Israel.  The study investigates the mutations that occur in B-cell maturation and how the propensity for a change to be selected is affected by where in the antibody structure it is located. It nicely combines analysis of both DNA and amino-acid sequence with structural considerations to inform conclusions about how in vivo affinity maturation occurs.

Before being exposed to an antigen, an antibody has a sequence determined by a combination of genes (V and J for the light chain; V, D and J for the heavy chain). Once exposed, B-cells (the cells that produce antibodies), undergo somatic hyper-mutation (SHM) to optimise the antibody-antigen (ab-ag) interaction. These mutations are commonly thought to be promoted at activation-induced deaminase (AID) hotspots.

The authors’ first finding is that the locations of SHMs do not correlate well with the positions of AID hotspots and that the distribution of their distance to a hotspot is not much different to that of the background distribution. They conclude that although perhaps a mechanism to promote mutation, AID hotspots are not a strong factor that indicate whether a mutation will fix.

Motivated to find other determinants for SHM preferences, the study turns to examining structural features and energetics of the molecules. SHMs are found to be more prevalent on the VH domain of an Fv than the VL. However, when present, the energetic importance of an SHM is not related to the domain it is on. In contrast, the contribution an SHM makes to the binding energy is related to its structural location. As one might perhaps expect, those SHMs in positions that can make contact with the antigen have more affect than those that do not. Consideration of their propensity instead of raw frequency also shows that SHMs are more prevalent in antibody-antigen interfaces than in the rest of the molecule. However, they are also likely to occur in the VH-VL interface suggesting an importance for this region in fine-tuning the geometry and flexibility of the binding site.


Figure taken from Burkovitz et al shows a) the location of different structural regions on the Fv b) the energetic contribution of the SHMs in each region c) the fraction of SHMs in the regions and their relative size d) the propensity for an SHM to occur in each of the five structural regions.

Perhaps the most interesting result of this study is the authors’ conclusions about the propensity of SHMs to mutate germline residues to particular amino-acids. It is found that whilst germline amino-acid usage in binding sites is distinctive from other protein-protein interfaces, the residue profiles of SHMs are less diverged. They therefore act to bring the properties ab-ag interaction towards those seen in normal interactions. This may suggest, as proposed by other studies, that the somatic hyper-mutation process is similar to mutation properties observed in evolution. In addition, it is found that five amino-acids, asparagine, arginine, serine, threonine and aspartic acid are the most common substitutions made in SHM. Finally, positions where SHMs most often have an important effect on binding energy are presented. These positions, and the amino-acid preferences provide promising targets for use in rational antibody design procedures.

Journal Club: Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree

For journal club this week I decided to look at a paper by Moore et al. on the modular evolution of proteins.

Modular evolution, or the rearrangement of the domain architecture of a protein, is one of the key drivers behind functional diversification in the protein universe. I used the example in my talk of the multi-domain protein Peptidase T, which contains a catalytic domain homologous to Carboxypeptidase A, a zinc dependent protease. The additional domain in Peptidase T induces the formation of a dimer, which restricts the space around the active site and so affects the specificity of the enzyme.


The multi-domain protein Peptidase T in a dimer (taken from Bashton and Chothia 2007). The active site is circled in green. Carboxypeptidase A is made up of a single domain homologous to the catalytic domain (in blue) of Peptidase T.

I took this case study from a really interesting paper, The generation of new protein functions by the combination of domains (Bashton and Chothia, 2007), which explores several other comparisons between the functions of multi-domain proteins and their single domain homologues.

What this paper does not address however is the directionality of such domain reorganisations. In all these examples, it is not clear whether the multi-domain organisation has evolved from the single domain enzyme or vice versa. Which brings me back to the paper I was presenting on, which attempts a reconstruction of domain arrangements followed by a categorisation of rearrangement events.

Essentially, given a phylogenetic tree of 20 closely related pancrustacean species, the paper takes the different domain arrangements on the genomes (step 1), assigns the presence or absence of each arrangement at interior nodes on the tree (step 2), and then assigns each gained arrangement to one of four possible rearrangement events (step 3).

1. Domain Annotation
The authors use different methods to annotate domains on the genomes. They conclude the most effective methodology is to use the clan level (where families with suspected homologies are joined together… similar to our beloved superfamily classification) of Pfam-A (high quality family alignments with a manually chosen seed sequence). Moreover, they collapse any consecutive stretches of identical domains into one “pseudo-domain”, eliminating the effect of the domain repeat number on an arrangement’s definition.

2. Ancestral State Reconstruction
The ancestral state reconstruction of each domain arrangement (it’s presence/absence at each internal node on the tree) is a result of a 2-pass sweep across the tree: the first from leaves to root, and the second from the root to the leaves. On the first pass the presence of an arrangement on a parent node is decided by majority rule on the state of its children. If the arrangement is present in one child node but absent in the other, the state at the parent node is defined as uncertain. Any uncertain child nodes have a neutral impact on their parent node’s state (i.e. if a parent has a child with the arrangement and a child with an uncertain state the arrangement will be annotated as present in the parent node). On the second pass (from root to leaves) uncertain nodes are decided by the state at their parent node. An uncertain arrangement at the root will be annotated as present. For more details and a clearer explanation see Box 1 in the figure below.


A schematic for the assignment of domain recombination events. Box 1 gives the algorithm for the ancestral state reconstruction. Figure S2 from Moore et al. 2013.

3. Rearrangement events
For each gained event on a particular branch, the authors then postulated one of four simple rearrangement events dependent on the arrangements on the parent’s predicted proteome.

i) Fusion: A gained domain arrangement (A,B,C) on a child’s proteome is the result of fusion if the parent’s proteome contains both the arrangements (A,B) AND (C) (as one example).
ii) Fission: A gained arrangement (A,B,C) is the result of fission if the parent contains the arrangement (A,B,C,D) AND the child also contains the arrangement (D).
iii) Terminal Loss: A gained arrangement (A,B,C) is the result of terminal loss if the parent contains the arrangement (A,B,C,D) AND the child does not contain the arrangement (D).
iv) Domain gain: A gained arrangement (A,B,C) is the result of domain gain if the parent contains (A,B) but not (C).

Any gained arrangement which cannot be explained by these cases (as a single-step solution) is annotated as having no solution.


The authors find, roughly speaking, that the domain arrangements they identify fall into a bimodal distribution. The vast majority are those which are seen on only one genome, of which over 80% are multi-domain arrangements. There are also a sizeable number of arrangements seen on every single genome, the vast majority of which are made up of a single domain. I do wonder though, how much of this signal is due to the relative difficulty of identifying and assigning multiple different domains compared to just a single domain. While it seems unlikely that this would explain the entirety of this observation (on average, 75% of proteins per genome were assigned) it would be interesting to have seen how the authors address this possible bias.

Interestingly, the authors also find a slight inflation in fusion events over fission events across the tree (around 1 more per million years), although there are more fusion events nearer the root of the tree, with fission dominating nearer the leaves, and in particular, on the dense Drosophila subtree.

Finally, the authors performed a functional term enrichment analysis on the domain arrangements gained by fusion and fission events and showed that, in both cases, terms relating to signalling were significantly overrepresented in these populations, emphasising the potential importance that modular evolution may play in this area.