Crystallographic programming: Super short tour of the cctbx

Two of the leading packages in crystallography are Phenix and CCP4. For most practicing crystallographers they will interact via with these to progress a single crystallographic data-set from diffraction images, through integration, merging, phasing, model building and hopefully deposition.

However, if you want to develop crystallographic software, you will likely need to decide on a framework to build upon. Phenix is built on the comprehensive cctbx library, whereas CCP4 programs are typically standlone, although common crystallographic libraries such as clipper and cctbx are utilised.

CCTBX is written mainly in python, with core crystallographic functionality written in C++. My usual starting place for understanding functionality is through the pdb parser tutorial. This introduces the concept of a hierarchy, a iterative way to represent a macromolecule:

from iotbx.pdb import hierarchy
pdb_in = hierarchy.input(file_name="model.pdb")
for chain in pdb_in.hierarchy.only_model().chains() :
  for residue_group in chain.residue_groups() :
    for atom_group in residue_group.atom_groups() :
      for atom in atom_group.atoms() :
        if (atom.element.strip().upper() == "ZN") :
      if (atom_group.atoms_size() == 0) :
    if (residue_group.atom_groups_size() == 0) :
f = open("model_Zn_free.pdb", "w")

Although there are many ways to parse a pdb file, the introduction to iotbx.pdb, gives a view of how xray structure data can be associated to the model. The tour of the cctbx can be helpful starting place, especially for understanding how the python and c++ functionality interact through boost and the scitbx.array_family.flex. Unfortunately, documentation on cctbx tends to vary in quality and quantity throughout the modules:

Other components of the library include ways to simulate crystallographic data through simtbx,  and tools for processing xfel data.

As the library is open source, github hosted source code allows exploration of previously written routines, which can be very helpful for understanding the inner workings of the library. Note that there are also bulletin boards for users and developers of phenix and cctbx respectively. A few tutorials can also be found.

Hopefully this post will give someone other than me a reminder of where to find resources to get started developing within CCTBX.

Paper review: “Inside the black box”

There are nearly 17,000 Oxford students on taught courses. They turn up reliably every October. We send them to an army of lecturers and tutors, drawn from every rank of the research hierarchy. As members of that hierarchy, we owe it to the students – all 17,000 of them – to teach them as best we can.

And where can we learn the most about how to teach? There are 438,000 professional teachers in the UK. Maybe people who spend all of their working time on the subject might have good strategies to help people learn.

The context of the paper

Teachers obsess over assessment. Assessment is the process by which teachers figure out what students have learned. It is probably true that assessment is the only reason we have classrooms at all.

Inside the Black Box is of the vanguard of recent changes in educational thinking. Modern teaching regards good pedagogy as a practical skill. Like other types of performance, it depends on a specific set of concrete actions which can be taught and learned. Not everyone is a natural teacher – but nearly everyone can become a competent teacher.

Formative assessment is the focus of Inside the Black Box. The article argues that this process, in which teachers figure what students know and tell them how it’s going wrong, is essential to good classroom practice.

What is the black box?

The black box is the classroom. After societal convulsions over class sizes, funding deficits, curriculum reforms, and examination structure, it’s time – says the article, in 2001 – that we focus on what actually goes on inside the classroom. These social changes, it says, adjust the inputs to the black box, and society expects better things out of the black box. But what if changing the inputs makes the work inside the black box harder? Don’t we have an obligation to figure out what needs to happen to get students to learn?

The article touches three questions:

  • Is there evidence that improving formative assessment raises standards?
  • Is there evidence that there is room for improvement?
  • Is there evidence about how to improve formative assessment?

The answers are yes, yes, and yes. In meta-analyses of educational experiments, formative assessment consistently raises standards. These experiments match the experience of teachers, who know that the least effective lessons are those which do not respond to students’ needs. Standard observations – such as those from Ofsted – ask teachers to answer what are they learning, and then how do you know, and then what are you doing about it?

The second question – is there room for improvement? – is one they address in great detail in the context of primary and secondary education. Some criticisms (the giving of grades for its own sake, unintentional encouragement of “rote or superficial learning”, relentless competition between students) seem applicable in different parts of our university context. A greater weakness is a lack of emphasis. People engaged in university teaching frequently center the delivery of knowledge instead of learning, an idea exacerbated by our obsession with lectures and masked by the long lag between those lectures and the exams in which we assess them.


Inside the Black Box makes specific recommendations for instructors about how to engage in formative assessment. Those recommendations – unusually, for an item in the educational literature – are specific and detailed. But rather than focus on them, it is worth examining three themes which run across the article.

The overriding focus is the importance of formative assessment. If we care about what students learn, then we’ve got to be checking what it is that they actually are learning. Opportunities for formative assessment should be “designed into any piece of teaching”. In extremis, this idea has interesting implications for the institution of lectures, which generally lack them entirely.

A subsidiary idea is the importance of setting clear objectives for learning. Too many students view learning as a series of exercises rather than a step in the formation of a coherent body of knowledge. The overarching direction should be made clear. And on a more detailed level, we need to be explicit about what outcomes we want our students to obtain so that they know whether they are making satisfactory progress. Formative assessment must make reference to expectations, and formative self- or peer assessment becomes impossible if those expectations are not well-understood.

And this discussion ties into a final point: when students truly apply themselves to the task of learning, their self-perception and self-esteem becomes bound up in it. Ineffective expectation-setting and insufficient clarity about the means for improvement result in students feeling demotivated, which causes them to revise their goals downward. They put in less effort and achieve outcomes that are worse. These effects are costly and can be avoided by effective formative assessment.

Inside the Black Box is a diversion from our diet of scientific articles, but I think it is worth our attention. Pedagogy is difficult to get right. In the university context, good practice is the subject of little attention and rarely assessed. Thinking about good asssessment means that our students benefit.

But all communication activities are a form of teaching. Really good teachers communicate really well. When good communication happens, everyone benefits, inside and outside the black box.

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

  • Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
  • Step 2: generate four predicted contact maps using PconsC3.
  • Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
  • Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts  force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

Latexing with gvim

Here I’ll share my set-up for writing Latex with gvim instead of a separate Latex editor. If you are text-editor averse, this blog post is not for you. But if, like me, you love vim and hate useless GUIs, this might be helpful.

We’re lucky to have nice big screens in the Stats Department, but I tend to prefer writing on my MacBook (I find it’s easier to transport to e.g. a cafe, my home, etc). Until now, I’ve been happily using TexMaker for writing, but during a recent period of intense Latexing I started to find the useable screen space oppressively small. The unnecessary GUI had to go.   

No offence TexMaker but I don’t like you

One of our good friends in Statistical Genetics recommended some things to help me with the transition to just using good old (g)vim, which I will now recommend to you.

The key thing is the LaTex-Box plug-in for vim, which gives you the compilation commands, as well as the essentials such as smart indentation, highlight matching, command completion, etc. I used pathogen to install it (see the GitHub for instructions).

Of course, you can then customise your .vimrc file to add more helpful things. This can be the simple preferences, such as using a light background when using gvim:


if has(“gui_running”)

        set background=light


You can also do more complicated magic like tabbing through available commands, and the ability to minimise sections, etc. Sidenote: to make working with paragraphs easier, I recommend setting the up/down arrows to move the cursor to the next line in the GUI rather than the next actual line. I prefer overriding this behaviour only in gvim, while leaving the normal behaviour in vim (for actual coding). But each to their own.

To get started, open a .tex file, then compile and view the document with the command Latexmk.

Command suggestions are an example of a magical feature added in .vimrc

The configurations for this command are set in the file .latexmkrc. Mine looks like this:


$recorder = 1;
$pdf_mode = 1;
$bibtex_use = 2;
$pdflatex = "pdflatex --shell-escape %O %S";
$pdf_previewer = "start open -a skim %O %S";

My pdf viewer of choice on Mac is Skim, which autoupdates. I view the source and preview at the same time using split view. Please admire the beauty below:

Wow what a beautiful screen

My favourite part is that whenever you save (w), it recompiles and updates the preview. As someone who accidentally types :w everywhere that isn’t vim, it’s nice that this is now productive. It also recompiles automatically if the .bib file is updated. Note that if you have errors at compilation (I’m sure you don’t), you can view them with the command LatexErrors.

Now you too can be a (nearly) GUI-free lightweight Latexer. Enjoy!


Journal Club post: Interface between Ig-seq and antibody modelling

Hi everyone! In this blog post, I would like to review a couple of relatively recent papers about antibody modelling and immunoglobulin gene repertoire NGS, also known as Ig-seq. Previously I used to work as a phage display scientist and I initially struggled to understand all new terminology about computational modelling when I joined Charlotte’s group last January. Hence, the paramount aim of my blog post is to decipher commonly used jargon in the computational world into less complicated text.

The three-dimensional structure of an antibody dictates its function. Antibody sequences obtained from Ig-seq cannot be directly translated into antibody folding, aggregation and function. Several ways exist to interrogate antibody structure, including X-ray crystallography and NMR spectroscopy, expression, and computational modelling. These methods vary in throughput as well as precision. Here, I will concentrate my attention on computational modelling. First of all, the most commonly confused term is a decoy. In antibody structure prediction, a decoy is a modelled antibody structure that can be ranked and selected by a tool as the closest to the native antibody structure. A number of antibody modelling tools exist, each employing a different methodology and a number of generated decoys. Good reviews on the antibody structure prediction are here (1,2). I will try to draw a very gross summary about how all these unique modelling tools work. To do so, I assume that people are familiar with antibody sequence/structure relationship – if not please check (3). Antibody framework region are sequence invariant, hence their structure can be deduced from sequence identity with high confidence. PDB (4) act as the source of structures for antibody modelling. Canonical CDRs (all CDRs except for CDR-H3) can be put into a limited number of structures. Several definitions of canonical classes exist (5,6), but, in essence, the canonical CDR must contain residues that define a particular class. Next, antibody orientation is calculated or copied from PDB. CDR-H3 modelling is very challenging and different approaches have been devised (7–9). The structure space of CDR-H3 is very vast (10) and hence, this loop cannot be put into a canonical class. Once CDR-H3 is modelled, the resultant decoy is checked for clashes (like impossible orientation of side chains).

Here, I would like to mention several examples on how antibody modelling can help to accelerate drug discovery. Dekosky et al. (11) mapped two Ig-seq datasets to antibody structures to interrogate how an antibody paratope changes in response to antigenic stimulation. The knowledge of paired full length VH-VL is crucial for the best antibody structure prediction. In this study they employed paired chain Ig-seq (12). However, this technique cannot sequence full length VH/VL, hence the V gene sequence had to be approximated. Computational paratope identification was employed to examine paratope convergences. There were several drawbacks of this paper: only 2,000 models (~1% of Ig-seq data) were modelled in 570,000 CPU time, and antibody sequences with longer than 16 aa long CDR-H3 were not included into analysis. The generation of a reliable configuration of long CDR-H3 is considered a hard task at this moment. Recently, Laffy et al. (13) investigated antibody promiscuity by mapping sequence to structure and validating the results with ELISA. The cohort of 10 antibodies, all with long CDR-H3 >= 15 aa were interrogated. They used a homology modelling tool to devise CDR-H3 structures. However, the availability of the appropriate structural template can be questioned, since CDR-H3 loops deposited in the PDB are predominantly shorter due to crystallographic constraints. As mentioned before, the paired VH/VL data is crucial for structure determination. Here, they used Dekosky et al. (11) data to devise the pairing. The approach can be streamlined once more paired data become available.

In conclusion, antibody modelling enables researchers to circumvent the cost and time associated with experimental approaches of antibody characterizations. The field of antibody modelling still needs improvements for faster and better structure prediction to achieve tasks such as modelling the entirety of Ig-seq data or long CDR-H3 loops. Currently, the fastest tool of antibody modelling is ABodyBuilder (8). It generates a model in 30 sec and its version is available online ( The availability of more structural information as well as algorithm improvements will facilitate more confident antibody modelling.


  1. Kuroda D, Shirai H, Jacobson MP, Nakamura H. Computer-aided antibody design. Protein Eng Des Sel (2012) 25:507–521. doi:10.1093/protein/gzs024
  2. Krawczyk K, Dunbar J, Deane CM. “Computational Tools for Aiding Rational Antibody Design,” in Methods in molecular biology (Clifton, N.J.), 399–416. doi:10.1007/978-1-4939-6637-0_21
  3. Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotech (2014) 32:158–168. doi:10.1038/nbt.2782
  4. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res (2007) 35: doi:10.1093/nar/gkl971
  5. Nowak J, Baker T, Georges G, Kelm S, Klostermann S, Shi J, Sridharan S, Deane CM. Length-independent structural similarities enrich the antibody CDR canonical class model. MAbs (2016) 8:751–760. doi:10.1080/19420862.2016.1158370
  6. North B, Lehmann A, Dunbrack RL. A new clustering of antibody CDR loop conformations. J Mol Biol (2011) 406:228–256. doi:10.1016/j.jmb.2010.10.030
  7. Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, Kuroda D, Frick R, Adolf-Bryfogle J, Biswas N, Dunbrack RL, Gray JJ. Modeling and docking of antibody structures with Rosetta. Nat Protoc (2017) 12:401–416. doi:10.1038/nprot.2016.180
  8. Leem J, Dunbar J, Georges G, Shi J, Deane CM. ABodyBuilder: Automated antibody structure prediction with data–driven accuracy estimation. MAbs (2016) 8:1259–1268. doi:10.1080/19420862.2016.1205773
  9. Marks C, Nowak J, Klostermann S, Georges G, Dunbar J, Shi J, Kelm S, Deane CM. Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction. Bioinformatics (2017) 33:1346–1353. doi:10.1093/bioinformatics/btw823
  10. Regep C, Georges G, Shi J, Popovic B, Deane CM. The H3 loop of antibodies shows unique structural characteristics. Proteins Struct Funct Bioinforma (2017) 85:1311–1318. doi:10.1002/prot.25291
  11. DeKosky BJ, Lungu OI, Park D, Johnson EL, Charab W, Chrysostomou C, Kuroda D, Ellington AD, Ippolito GC, Gray JJ, et al. Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires. Proc Natl Acad Sci U S A (2016)1525510113-. doi:10.1073/pnas.1525510113
  12. Dekosky BJ, Kojima T, Rodin A, Charab W, Ippolito GC, Ellington AD, Georgiou G. In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nat Med (2014) 21:1–8. doi:10.1038/nm.3743
  13. Laffy JMJ, Dodev T, Macpherson JA, Townsend C, Lu HC, Dunn-Walters D, Fraternali F. Promiscuous antibodies characterised by their physico-chemical properties: From sequence to structure and back. Prog Biophys Mol Biol (2016) doi:10.1016/j.pbiomolbio.2016.09.002

Journal club: Human enterovirus 71 protein interaction network prompts antiviral drug repositioning

Viruses are small infectious agents, which possess genetic code but have no independent metabolism. They propagate by infecting host cells and hijacking their machinery, often killing the cells in the process. One of the key challenges in developing effective antiviral therapies is the high mutation rate observed in viral genomes. A way to circumvent this issue is to target host proteins involved in virion assembly (also known as essential host factors, or EHFs), rather than the virion itself.

In their recent paper, Lu Han et al. [1] consider human virus protein-protein interactions in order to explore possible host drug targets, as well as drugs which could potentially be re-purposed as antivirals. Their study focuses on enterovirus 71 (EV71), one of the leading causes of hand, foot, and mouth disease.

Human virus protein-protein interactions and target identification

EHFs are typically detected by knocking out genes in the host organism and determining which of the knockouts result in virus control. Low repeat rates and high costs make this technique unsuitable for large scale studies. Instead, the authors use an extensive yeast two-hybrid screen to identify 37 unique protein-protein interactions between 7 of the 11 virus proteins and 29 human proteins. Pathway enrichment suggests that the human proteins interacting with EV71 are involved in a wide range of functions in the cell. Despite this range in functionality, as many as 17 are also associated with other viruses, either through known physical interactions, or as EHFs (Fig 1).

Fig. 1. Interactions between viral and human proteins (denoted as EIPs), and their connection to different viruses.

One of these is ATP6V0C, a subunit of vacuole ATP-ase. It interacts with the EV71 3A protein, and is a known essential host factor for five other viruses. The authors analyse the interaction further, and show that downregulating ATP6V0C gene expression inhibits EV71 propagation, while overexpressing it enhances virus propagation. Moreover, treating cells with bafilomycin A1, a selective inhibitor for vacuole ATP-ase, inhibits EV71 infection in a dose-dependent manner. The paper suggests that therefore ATP6V0C may be a suitable drug target, not only against EV71, but also perhaps even for a broad-spectrum antiviral. While this is encouraging, bafilomycin A1 is a toxic antibiotic used in research, but not suitable for human or drug use. Rather than exploring other compounds targeting ATP6V0C, the paper shifts focus to re-purposing known drugs as antivirals.

Drug prediction using CMap

A potential antiviral will ideally disturb most or all interactions between host cell and virion. One way to do this would be to inhibit the proteins known to interact with EV71. In order to check whether any known compounds already do so, the authors apply gene set enrichment analysis (GSEA) to data from the connectivity map (CMap). CMap is a database of gene expression profiles representing cellular response to a set of 1309 different compounds.  Enrichment analysis of the database reveals 27 potential EV71 drugs, of which the authors focus on the top ranking result, tanespimycin.

Tanespimycin is an orphan cancer drug, originally designed to target tumor cells by inhibiting HSP90. Its complex effects on the cell, however, may make it an effective antiviral. Following their CMap analysis, the authors show that tanespimycin reduces viral count and virus-induced cytopathic effects in a dose-dependent manner, without evidence of cytotoxicity.

Overall, the paper presents two different ways to think about target investigation and drug choice in antiviral therapeutics — by integrating different types of known host virus protein-protein interactions, and by analysing cell response to known compounds. While extensive further study is needed to determine whether the results are directly clinically relevant to the treatment of EV71, the paper shows how  interaction data analysis can be employed in drug discovery.


[1] Han, Lu, et al. “Human enterovirus 71 protein interaction network prompts antiviral drug repositioning.” Scientific Reports 7 (2017).


Drawing Networks in LaTeX with tikz-network

While researching on protein interaction networks it is often important to illustrate networks. For this many different tools are available, for example, Python’s NetworkX and Matlab, that allow the export of figures as pixelated images or vector graphics. Usually, these figures are then incorporated in the papers, which are commonly written in LaTeX. In this post, I want to present `tikz-network’, which is a novel tool to code and illustrate networks directly in LaTeX.

To create an illustration you define the network’s nodes with their positions and edges between these nodes. An example of a simple network is

   \Vertex[color = blue]{A}

The illustrations can be much more complex and allow dashed lines, opacity, and many other features. Importantly, the properties do not need to be specified in the LaTeX file itself but can also be saved in an external file and imported with the  \Vertices{data/vertices.csv}command. This allows the representation of more complex networks, for example the multilayer network below is created from the two files, the first representing the nodes

id, x, y ,size, color,opacity,label,layer 
A, 0, 0, .4 , green, .9 , a , 1
B, 1, .7, .6 , , .5 , b , 1
C, 2, 1, .8 ,orange, .3 , c , 1
D, 2, 0, .5 , red, .7 , d , 2
E,.2,1.5, .5 , gray, , e , 1
F,.1, .5, .7 , blue, .3 , f , 2
G, 2, 1, .4 , cyan, .7 , g , 2
H, 1, 1, .4 ,yellow, .7 , h , 2

and the second having the edge information:

u,v,label,lw,color ,opacity,bend,Direct
A,B, ab  ,.5,red   ,   1   ,  30,false
B,C, bc  ,.7,blue  ,   1   , -60,false
A,E, ae  , 1,green ,   1   ,  45,true
C,E, ce  , 2,orange,   1   ,   0,false
A,A, aa  ,.3,black ,  .5   ,  75,false
C,G, cg  , 1,blue  ,  .5   ,   0,false
E,H, eh  , 1,gray  ,  .5   ,   0,false
F,A, fa  ,.7,red   ,  .7   ,   0,true
D,F, df  ,.7,cyan  ,   1   ,   30,true
F,H, fh  ,.7,purple,   1   ,   60,false
D,G, dg  ,.7,blue  ,  .7   ,   60,false

For details, please see the extensive manual on the GitHub page of the project. It is a very new project and I only started using it but I like it so far for a couple of reasons:

  • it is easy to use, especially for small example graphs
  • the multilayer functionality is very convenient
  • included texts are automatically in the correct size and font with the rest of the LaTeX document
  • it can be combined with regular tikz commands to create more complex illustrations

Comparing naive and immunised antibody repertoire

Hi! This is my first post on Blopig as I joined OPIG in July 2017 for my second rotation project and DPhil.

During immune reactions to foreign molecules known as antigens, surface receptors of activated B-cells undergo somatic hypermutation to attain its high binding affinity and specificity to the target antigen. To discover how somatic hypermutation occurs to adapt the antibody from its germline conformation, we can compare the naive and antigen-experienced antibody repertoires. In this paper, the authors developed a protocol to carry out such comparison, detected, synthesised, expressed and validated the observed antibody genes against their target antigen.

What they have done:

  1. Mice immunisation: Naive (no antigens), CGG (a large protein), NP-CGG (hapten attached to a large protein).
  2. Sequencing: Total RNA was extracted from each spleen, cDNA was synthesised according to standard procedures, and amplified with the universal 5’-RACE primer (as oppose to the degenerate 5’-Vh primers) and the 3’-CH1 primer to distinguish between immunoglobulin-classes (IgG1, IgG2c and IgM). High throughput pyrosequencing was then used to recover the heavy chain sequences only.
  3. VDJ recombination analysis: V, D and J segments were assigned and the frequency of the VDJ combinations were plotted in a 3D graph.
  4. Commonality of the VDJ combination: For each VDJ combination, the “commonality” was counted from the average occurrence if n mice have the combination: if n=1, it’s the average occurrence if any 1 mouse has the combination; if n=5, the combination must be observed in all mice to generate a degree of commonality – otherwise it’s 0.
    • The effect of increasing n on commonality scores in IgG1 class: As we tighten the requirement for the commonality calculation, it becomes clear that IGHV9-3 is likely to target the CGG carrier, while IGHV1-72 is against the NP hapten.
    • IGHV9-3 can accommodate a wider range of D gene when targeting CGG alone. IGHV1-72 only uses IGHD1-1.
  5. Clustering V gene usage: Sequences were aligned to the longest sequence in the set (of VDJ combination), and the pairwise distance between sequences in the set were used to cluster the sequences using the UPGMA method.
    • A number of sequences were commonly found in different individuals. Among these sequences, one was randomly selected to proceed to the next step.
  6. Synthesis and validation of the detected antibody against the NP hapten: by comparing the antibody repertoires against the CGG and NP-CGG, the gene of the antibody against NP can be recovered. The authors in this paper chose to pair 3 different light chains to the chosen heavy chain, and assess the binding of the 3 antibodies.
    • NP-CGG bind well to both IGHV1-72 and IGHV9-3 antibodies; NP-BSA to IGHV1-72 only; and CGG to IGHV9-3 only.
    • The binding capabilities are affected by the light chain in the pair.

Key takeaway:

This work presented a metric of defining the “commonality” between individuals’ antibody repertoire and validated the identified antibody against its target antigen by combining with different light chains.

Proteins evolve on the edge of supramolecular self-assembly

Inspired by Eoin’s interesting talks on prions and prion diseases, and Nick’s discussion of how Cyro-Electron microscopy is going to be the end of an era for Crystallography. I thought I’d look at a paper that discusses aggregation of protein complexes, with some cryo-electron microscopy thrown in for good measure.

Supramolecular assembbly

a, A molecule gaining a single self-interacting patch forms a finite dimer. A self-interacting patch repeated on opposite sides of a symmetric molecule can result in infinite assembly. b, A point mutation in a dihedral octamer creates a new self-interacting patch (red), triggering assembly into a fibre.

Supramolecular assemblies are folded protein complexes forming into much larger units. This formation can be triggered by a mutation on a copy of the constituent homomers of the complex, acting as a self-interacting patch. If this patch were to form in a non-symmetric complex, it would likely form a finite assemble with a limited number of copies of the complex. However, if the complex has dihedral symmetry such that a patch is accessible at multiple separated locations, then complex can potentially form near infinite supramolecular assemblies. Continue reading

Slowing the progress of prion diseases

At present, the jury is still out on how prion diseases affect the body let alone how to cure them. We don’t know if amyloid plaques cause neurodegeneration or if they’re the result of it. Due to highly variable glycophosphatidylinositol (GPI) anchors, we don’t know the structure of prions. Due to their incredible resistance to proteolysis, we don’t know a simple way to destroy prions even using in an autoclave. The current recommendation[0] by the World Health Organisation includes the not so subtle: “Immerse in a pan containing 1N sodium hydroxide and heat in a gravity displacement autoclave at 121°C”.

There are several species including Water Buffalo, Horses and Dogs which are immune to prion diseases. Until relatively recently it was thought that rabbits were immune too. “Despite rabbits no longer being able to be classified as resistant to TSEs, an outbreak of ‘mad rabbit disease’ is unlikely”.[1] That being said, other than the addition of some salt bridges and additional H-bonds, we don’t know if that’s why some animals are immune.

We do know at least two species of lichen (P. sulcata and L. plumonaria) have not only discovered a way to naturally break down prions, but they’ve evolved two completely independent pathways to do so. How they accomplish this? We’re still not sure in fact, it was only last year that it was discovered that lichens may be composed of three symbiotic partnerships and not two as previously thought.[3]

With all this uncertainty, one thing is known: PrPSc, the pathogenic form of the Prion converts PrPC, the cellular form. Just preventing the production of PrPC may not be a good idea, mainly because we don’t know what it’s there for in the first place. Previous studies using PrP-knockout have shown hints that:

  • Hematopoietic stem cells express PrP on their cell membrane. PrP-null stem cells exhibit increased sensitivity to cell depletion. [4]
  • In mice, cleavage of PrP proteins in peripheral nerves causes the activation of myelin repair in Schwann Cells. Lack of PrP proteins caused demyelination in those cells. [5]
  • Mice lacking genes for PrP show altered long-term potentiation in the hippocampus. [6]
  • Prions have been indicated to play an important role in cell-cell adhesion and intracellular signalling.[7]

However, an alternative approach which bypasses most of the unknowns above is if it were possible to make off with the substrate which PrPSc uses, the progress of the disease might be slowed. A study by R Diaz-Espinoza et al. was able to show that by infecting animals with a self-replicating non-pathogenic prion disease it was possible to slow the fatal 263K scrapie agent. From their paper [8], “results show that a prophylactic inoculation of prion-infected animals with an anti-prion delays the onset of the disease and in some animals completely prevents the development of clinical symptoms and brain damage.”

[4] “Prion protein is expressed on long-term repopulating hematopoietic stem cells and is important for their self-renewal”. PNAS. 103 (7): 2184–9. doi:10.1073/pnas.0510577103
[5] Abbott A (2010-01-24). “Healthy prions protect nerves”. Nature. doi:10.1038/news.2010.29
[6] Maglio LE, Perez MF, Martins VR, Brentani RR, Ramirez OA (Nov 2004). “Hippocampal synaptic plasticity in mice devoid of cellular prion protein”. Brain Research. Molecular Brain Research. 131 (1-2): 58–64. doi:10.1016/j.molbrainres.2004.08.004
[7] Málaga-Trillo E, Solis GP, et al. (Mar 2009). Weissmann C, ed. “Regulation of embryonic cell adhesion by the prion protein”. PLoS Biology. 7 (3): e55. doi:10.1371/journal.pbio.1000055