# New toys for OPIG

OPIG recently acquired 55 additional computers all of the same make and model; they are of a decent specification (for 2015), each with quad-core i5 processor and 8GB of RAM, but what to do with them? Cluster computing time!

Along with a couple of support servers, this provides us with 228 computation cores, 440GB of RAM and >40TB of storage. Whilst this would be a tremendous specification for a single computer, parallel computing on a cluster is a significantly different beast.

This kind of architecture and parallelism really lends itself to certain classes of problems, especially those that have:

• Independent data
• Parameter sweeps
• Multiple runs with different random seeds
• Dirty great data sets
• Or can be snipped up and require low inter-processor communication

With a single processor and a single core, a computer looks like this:

These days, when multiple processor cores are integrated onto a single die, the cores are normally independent but share a last-level cache and both can access the same memory. This gives a layout similar to the following:

Add more cores or more processors to a single computer and you start to tessellate the above. Each pair of cores have access to their own shared cache, they have access to their own memory and they can access the memory attached to any other cores. However, accessing memory physically attached to other cores comes at the cost of increased latency.

Cluster computing on the other hand rarely exhibits this flat memory architecture, as no node can directly another node’s memory. Instead we use a Message Passing Interface (MPI) to pass messages between nodes. Though it takes a little time to wrap your head around working this way, effectively every processor simultaneously runs the exact same piece of code, the sole difference being the “Rank” of the execution core. A simple example of MPI is getting every code to greet us with the traditional “Hello World” and tell us its rank. A single execution with mpirun simultaneously executes the code on multiple cores:

$mpirun -n 4 ./helloworld_mpi Hello, world, from processor 3 of 4 Hello, world, from processor 1 of 4 Hello, world, from processor 2 of 4 Hello, world, from processor 0 of 4  Note that the responses aren’t in order, some cores may have been busy (for example handling the operating system) so couldn’t run their code immediately. Another simple example of this would be a sort. We could for example tell every processor to take several million values, find the smallest value and pass a message to whichever core has “Rank 0” that number. The core at Rank 0 will then sort that much smaller number set of values. Below is the kind of speedup which was achieved by simply splitting the same problem over 4 physically independent computers of the cluster. As not everyone in the group will have the time or inclination to MPI-ify their code, there is also HTCondor. HTCondor, is a workload management system for compute intensive jobs which allows jobs to be queued, scheduled, assigned priorities and distributed from a single head node to processing nodes, with the results copied back on demand. The server OPIG provides the job distribution system, whilst SkyOctopus provides shared storage on every computation node. Should the required package currently not be available on all of the computation nodes, SkyOctopus can reach down and remotely modify the software installations on all of the lesser computation nodes. # Loop Model Selection As I have talked about in previous blog posts (here and here, if interested!), the majority of my research so far has focussed on improving our ability to generate loop decoys, with a particular focus on the H3 loop of antibodies. The loop modelling software that I have been developing, Sphinx, is a hybrid of two other methods – FREAD, a knowledge-based method, and our own ab initio method. By using this hybrid approach we are able to produce a decoy set that is enriched with near-native structures. However, while the ability to produce accurate loop conformations is a major advantage, it is by no means the full story – how do we know which of our candidate loop models to choose? In order to choose which model is the best, a method is required that scores each decoy, thereby producing a ranked list with the conformation predicted to be best at the top. There are two main approaches to this problem – physics-based force fields and statistical potentials. Force fields are functions used to calculate the potential energy of a structure. They normally include terms for bonded interactions, such as bond lengths, bond angles and dihedral angles; and non-bonded interactions, such as electrostatics and van der Waal’s forces. In principle, they can be very accurate, however they have certain drawbacks. Since some terms have a very steep dependency on interatomic distance (in particular the non-bonded terms), very slight conformational differences can have a huge effect on the score. A loop conformation that is very close to the native could therefore be ranked poorly. In addition, solvation terms have to be used – this is especially important in loop modelling applications since loop regions are generally found on the surface of proteins, where they are exposed to solvent molecules. The alternatives to physics-based force fields are statistical potentials. In this case, a score is achieved by comparing the model structure (i.e. its interatomic distances and contacts) to experimentally-derived structures. As a very simple example, if the distance between the backbone N and Cα of a residue in a loop model is 2Å, but this distance has not been observed in known structures, we can assume that a distance of 2Å is energetically unfavourable, and so we can tell that this model is unlikely to be close to the native structure. Advantages of statistical potentials over force fields are their relative ‘smoothness’ (i.e. small variations in structure do not affect the score as much), and the fact that all interactions do not need to be completely understood – if examples of these interactions have been observed before, they will automatically be taken into account. I have tested several statistical potentials (including calRW, DFIRE, DOPE and SoapLoop) by using them to rank the loop decoys generated by our hybrid method, Sphinx. Unfortunately, none of them were consistently able to choose the best decoy out of the set. The average RMSD (across 70 general loop targets) of the top-ranked decoy ranged between 2.7Å and 4.74Å for the different methods – the average RMSD of the actual best decoy was much lower at 1.32Å. Other researchers have also found loop ranking challenging – for example, in the latest Antibody Modelling Assessment (AMA-II), ranking was seen as an area for significant improvement. In fact, model selection is seen as such an issue that protein structure prediction competitions like AMA-II and CASP allow the participants to submit more than one model. Loop model selection is therefore an unsolved problem, which must be investigated further to enable reliable predictions to be made. # Network Pharmacology The dominant paradigm in drug discovery has been one of finding small molecules (or more recently, biologics) that bind selectively to one target of therapeutic interest. This reductionist approach conveniently ignores the fact that many drugs do, in fact, bind to multiple targets. Indeed, systems biology is uncovering an unsettling picture for comfortable reductionists: the so-called ‘magic bullet’ of Paul Ehrlich, a single compound that binds to a single target, may be less effective than a compound with multiple targets. This new approach—network pharmacology—offers new ways to improve drug efficacy, to rescue orphan drugs, re-purpose existing drugs, predict targets, and predict side-effects. Building on work Stuart Armstrong and I did at InhibOx, a spinout from the University of Oxford’s Chemistry Department, and inspired by the work of Shoichet et al. (2007), Álvaro Cortes-Cabrera and I took our ElectroShape method, designed for ultra-fast ligand-based virtual screening (Armstrong et al., 2010 & 2011), and built a new way of exploring the relationships between drug targets (Cortes-Cabrera et al., 2013). Ligand-based virtual screening is predicated on the molecular similarity principle: similar chemical compounds have similar properties (see, e.g., Johnson & Maggiora, 1990). ElectroShape built on the earlier pioneering USR (Ultra-fast Shape Recognition) work of Pedro Ballester and Prof. W. Graham Richards at Oxford (Ballester & Richards, 2007). Our new approach addressed two Inherent limitations of the network pharmacology approaches available at the time: • Chemical similarity is calculated on the basis of the chemical topology of the small molecule; and • Structural information about the macromolecular target is neglected. Our method addressed these issues by taking into account 3D information from both the ligand and the target. The approach involved comparing the similarity of each set ligands known to bind to a protein, to the equivalent sets of ligands of all other known drug targets in DrugBank, DrugBank is a tremendous “bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.” This analysis generated a network of related proteins, connected by the similarity of the sets of ligands known to bind to them. We looked at two different kinds of ligand similarity metrics, the inverse Manhattan distance of our ElectroShape descriptor, and compared them to 2D Morgan fingerprints, calculated using the wonderful open source cheminformatics toolkit, RDKit from Greg Landrum. Morgan fingerprints use connectivity information similar to that used for the well known ECFP family of fingerprints, which had been used in the SEA method of Keiser et al. We also looked at the problem from the receptor side, comparing the active sites of the proteins. These complementary approaches produced networks that shared a minimal fraction (0.36% to 6.80%) of nodes: while the direct comparison of target ligand-binding sites could give valuable information in order to achieve some kind of target specificity, ligand-based networks may contribute information about unexpected interactions for side-effect prediction and polypharmacological profile optimization. Our new target-fishing approach was able to predict drug adverse effects, build polypharmacology profiles, and relate targets from two complementary viewpoints: ligand-based, and target-based networks. We used the DUD and WOMBAT benchmark sets for on-target validation, and the results were directly comparable to those obtained using other state-of-the-art target-fishing approaches. Off-target validation was performed using a limited set of non-annotated secondary targets for already known drugs. Comparison of the predicted adverse effects with data contained in the SIDER 2 database showed good specificity and reasonable selectivity. All of these features were implemented in a user-friendly web interface that: (i) can be queried for both polypharmacology profiles and adverse effects, (ii) links to related targets in ChEMBLdb in the three networks (2D, 4D ligand and 3D receptor), and (iii) displays the 2D structure of already annotated drugs. References Armstrong, M. S., G. M. Morris, P. W. Finn, R. Sharma, L. Moretti, R. I. Cooper and W. G. Richards (2010). “ElectroShape: fast molecular similarity calculations incorporating shape, chirality and electrostatics.” J Comput Aided Mol Des, 24(9): 789-801. 10.1007/s10822-010-9374-0. Armstrong, M. S., P. W. Finn, G. M. Morris and W. G. Richards (2011). “Improving the accuracy of ultrafast ligand-based screening: incorporating lipophilicity into ElectroShape as an extra dimension.” J Comput Aided Mol Des, 25(8): 785-790. 10.1007/s10822-011-9463-8. Cortes-Cabrera, A., G. M. Morris, P. W. Finn, A. Morreale and F. Gago (2013). “Comparison of ultra-fast 2D and 3D ligand and target descriptors for side effect prediction and network analysis in polypharmacology.” Br J Pharmacol, 170(3): 557-567. 10.1111/bph.12294. Johnson, A. M., & G. M. Maggiora (1990). “Concepts and Applications of Molecular Similarity.” New York: John Willey & Sons. Landrum, G. (2011). “RDKit: Open-source cheminformatics.” from http://www.rdkit.org. Keiser, M. J., B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin and B. K. Shoichet (2007). “Relating protein pharmacology by ligand chemistry.” Nat Biotechnol, 25(2): 197-206. 10.1038/nbt1284. Wishart, D. S., C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang and J. Woolsey (2006). “DrugBank: a comprehensive resource for in silico drug discovery and exploration.” Nucleic Acids Res, 34(Database issue): D668-672. 10.1093/nar/gkj067. # Co-translational insertion and folding of membrane proteins The alpha-helical bundle is the most common type of fold for membrane proteins. Their diverse functions include transport, signalling, and catalysis. While structure determination is much more difficult for membrane proteins than it is for soluble proteins, it is accelerating and there are now 586 unique proteins in the database of Membrane Proteins of Known 3D Structure. However, we still have quite a poor understanding of how membrane proteins fold. There is increasing evidence that it is more complicated than the two-stage model proposed in 1990 by Popot and Engelman. The machinery that inserts most alpha-helical membrane proteins is the Sec apparatus. In prokaryotes, it is located in the plasma membrane, while eukaryotic Sec is found in the ER. Sec itself is an alpha-helical bundle in the shape of a pore, and its structure is able both to allow peptides to pass fully across the membrane, and also to open laterally to insert transmembrane helices into the membrane. In both cases, this occurs co-translationally, with translation halted by the signal recognition particle until the ribosome is associated with the Sec complex. If helices are inserted during the process of translation, does folding only begin after translation is finished? On what timescale are these folding processes occuring? There is evidence that a hairpin of two transmembrane helices forms on a timescale of miliseconds in vitro. Are helices already interacting during translation to form components of the native structure? It has also been suggested that helices may insert into the membrane in pairs, via the Sec apparatus. There are still many aspects of the insertion process which are not fully understood, and even the topology of an alpha-helical membrane protein can be affected by the last part of the protein to be translated. I am starting to investigate some of these questions by using computational tools to learn more about the membrane proteins whose structures have already been solved. # Network Hubs Some times real networks contain few nodes that are connected to a large portion of the nodes in the network. These nodes, often called ‘hubs’ (or global hubs), can change global properties of the network drastically, for example the length of the shortest path between two nodes can be significantly reduced by their presence. The presence of hubs in real networks can be easily observed, for example, in flight networks airports such as Heathrow (UK) or Beijing capital IAP (China) have a very large number of incoming and outgoing flights in comparison to all other airports in the world. Now, if in addition to the network there is a partition of the nodes into different groups ‘local hubs’ can appear. For example, assume that the political division is a partition of the nodes (airports) into different countries. Then, some capital city airports can be local hubs as they have incoming and outgoing flights to most other airports in that same country. Note that a local hub might not be a global hub. There are several ways to classify nodes based on different network properties. Take for example, hub nodes and non-hub nodes. One way to classify nodes as hub or non-hub uses the participation coefficient and the standardised within module degree (Gimera & Amaral, 2005). Consider a partition of the nodes into$latex N_M$groups. Let$latex k_i$be the degree of node$latex i$and$latex k_{is}$the number of links or edges to other nodes in the same group as node$latex i$. Then, the participation coefficient of node$latex i$is:$latex P_i = 1 – \sum_{s=1}^{N_M} k_{is}^2 / k_i^2$. Note that if node$latex i$is connected only to nodes within its group then, the participation coefficient of node$latex i$is 0. Otherwise if it is connected to nodes uniformly distributed across all groups then the participation coefficient is close to 1 (Gimera & Amaral, 2005). Now, the standardised within module degree:$latex z_i= (k_i – \bar{k}_{s_i}) / \sigma_{k_{s_i}}$, where$latex s_i$is the group node$latex i$belongs to and$latex \sigma_{k_{s_i}}$is the standard deviation of$latex k\$ in such group.

Gimera &  Amaral (2005) proposed a classification of the nodes of the network based on their corresponding values of the previous statistics. In particular they proposed a heuristic classification of the nodes depicted by the following plot

Image taken from the paper “Functional cartography of complex
metabolic networks” by Guimera and Amaral, 2005.

Guimera and Amaral (2005), named regions R1-R4 as non-hub regions and R5-R7 as hub regions. Nodes belonging to: R1 are labelled as ultra-peripheral nodes, R2 as peripheral nodes, R3 as nun-hub connector nodes, R4 as non-hub kinless nodes, R5 as provincial nodes, R6 as connector hubs and R7 as kinless hubs. For more details on this categorisation please see Guimera and Amaral (2005).

The previous regions give an intuitive classification of network nodes according to their connectivity under a given partition of the nodes. In particular it gives an easy way to differentiate hub nodes of non-hub nodes. However the classification of the nodes into these seven regions (R1-R7) depends on the initial partition of the nodes.

1. R. Guimerà, L.A.N. Amaral, Functional cartography of complex metabolic networks, Nature 433 (2005) 895–900

# Next generation sequencing of paired heavy and light chain sequences

At the last meeting before Christmas I covered the article by DeKosky et al. describing a new methodology for sequencing of paired VH-VL repertoire developed by the authors.

In the recent years there have been an exponential growth of available antibody sequences, caused mainly by the development of cheap and high-throughput Next Generation Sequencing (NGS) technologies. This trend led to the creation of several publicly available antibody sequence databases such as the DIGIT database and the abYsis database, containing hundreds of thousands of unpaired light chain and heavy chain sequences from over 100 species. Nevertheless, the sequencing of paired VH-VL repertoire remained a challenge, with the available techniques suffering from low throughput (<700 cells) and high cost. In contrast, the method developed by DeKosky et al. allows for relatively cheap paired sequencing of most of the 10^6 B cells contained within a typical 10-ml blood draw.

The work flow is as follows: first the isolated cells, dissolved in water, and magnetic poly(dT) beads mixed with cell lysis buffer are pushed through a narrow opening into a rapidly moving annular oil phase, resulting in a thin jet that coalescences into droplets, in such a way that each droplet has a very low chance of having a cell inside it. This ensures that the vast majority of droplets that do contain cells, contain only one cell each. Next, the cell lysis occurs within the droplets and the mRNA fragments coding for the antibody chains attach to the poly(dT) beads. Following that, the mRNA fragments are recovered and linkage PCR is used to generate 850 bp cDNA fragments for NGS.

To analyse the accuracy of their methodology the authors sequenced paired CDR-H3 – CDR-L3 sequences from blood samples obtained from three different human donors, filtering the results by 96% clustering, read-quality and removing sequences with less than two reads. Overall, this resulted in ~200,000 paired CDR-H3 – CDR-L3 sequences. The authors found that pairing accuracy of their methodology was ~98%.

The article also contained some bioinformatics analysis of the data. The authors first analysed CDR-L3 sequences that tend to pair up with many diverse CDR-H3 sequences and whether such “promiscuous” CDR-L3s are also “public” i.e. they are promiscuous and common in all three donors. Their results show that out of 50 most common promiscuous CDR-L3s 49 are also public. The results also show that the promiscuous CDR-L3s show little to no modification, being very close to the germline sequence.

Illustration of the sequencing pipeline

The sequencing data also contained examples of allelic inclusion, where one B-cell expresses two B cell receptors (almost always one VH gene and two distinct VL genes). It was found that about ~0.5% of all analysed B-cells showed allelic inclusion.

Finally, the authors looked at the occurrence of traits commonly associated with broadly Neutralizing Antibodies (bNAbs), produced to fight rapidly mutating pathogens (such as the influenza virus). These traits were short (<6 aa) CDR-L3 and long (11 – 18 aa) CDR-H3s. In total, the authors found 31 sequences with these features, suggesting that bNAbs can be found in the repertoire of healthy donors.

Overall this article presents very interesting and promising method, that should allow for large-scale sequencing of paired VH-VL sequences.

# We can model everything, right…?

### First, happy new year to all our Blopig fans, and we all hope 2016 will be awesome!

A couple of months ago, I was covering this article by Shalom Rackovsky. The big question that jumps out of the paper is, has modelling reached its limits? Or, in other words, can bioinformatics techniques be used to model every protein? The author argues that protein structures have an inherent level of variability that cannot be fully captured by computational methods; thus, he raises some scepticism on what modelling can achieve. This isn’t entirely news; competitions such as CASP show that there’s still lots to work on in this field. This article takes a very interesting spin when Rackovsky uses a theoretical basis to justify his claim.

For a pair of proteins and Q, Rackovsky defines their relationship depending on their sequence and structural identity. If and share a high level of sequence identity but have little structural resemblance, and are considered to be a conformational switch. Conversely, if and share a low level of sequence identity but have high structural resemblance, they are considered to be remote homologues.

Case of a conformational switch – two DNAPs with 100% seq identity but 5.3A RMSD.

Haemoglobins are ‘remote homolgues’ – despite 19% sequence identity, these two proteins have 1.9A RMSD.

From here on comes the complex maths. Rackovsky’s work here (and in papers prior, example) assume that there are periodicities in properties of proteins, and thus apply fourier transforms to compare protein sequences and structures.

In the case of comparing protein sequences, instead of treating sequences as a string of letters, protein sequences are characterised by an x 10 matrix. represents the number of amino acids in protein (or Q), and each amino acid has 10 biophysical properties. The matrix then undergoes Fourier Transformation (FT), and the resulting sine and cosine coefficients for proteins and are used to calculate the Euclidean distance between each other.

When comparing structures, proteins are initially truncated into length-L fragments, and the dihedral angle, bond length and bond angle for each fragment is collected into a matrix. The distribution of matrices allows us to project proteins onto a pre-parameterised principal components space. The Euclidean distance between the newly-projected proteins is then used to quantify protein structural similarity.

In both sequence and structure distances, the distances are normalised and centred around 0,0 by calculating the average distance between and its M-nearest neighbours, and then adjusted by the global average. Effectively, if a protein has an average structural distance, it will tend toward 0,0.

The author uses a dataset of 12000 proteins from the CATH set to generate the following diagram; the Y-axis represents sequence similarity and the X-axis is the structural similarity. Since these axes are scaled to the mean, the closer you are to 0, it means you’re closer to the global average sequence or structure distance.

The four quadrants: along the diagonal is a typical linear relationship (greater sequence identity = more structural similarity). The lower-right quadrant represents proteins with LOW sequence similarity yet HIGH structural similarity. In the upper-left quadrant, proteins have LOW structural similarity but HIGH sequence similarity.

Rackovsky argues that, while the remote homologue and conformational switch seem like rare phenomena, it accounts for approximately ~50% of his dataset. Although he does account for the high density of proteins within 0,0, the paper does not clearly address the meaning of these new metrics. In other words, the author does not translate these values to something we’re more familiar with (e.g.RMSD, and sequence identity % for structural and sequence distance). Although the whole idea is that his methods are supposed to be an alignment-free method, it’s still difficult to draw relationships to what we already use as the gold standard in traditional protein structure prediction problems. Also, note that the structure distance spans between -0.1 and 0.1 units whereas sequence identity spans between -0.3 and 0.5. The differences in scale are also not covered – i.e., is a difference of 0.01 units an expected value for protein structure distance, and why are the jumps in protein structure distance so much smaller than jumps in sequence space?

The author makes more interesting observations in the dataset (e.g. α/β mixed proteins are more tolerant to mutations in comparison to α- or β-only proteins) but the observations are not discussed in depth. If α/β-mixed proteins are indeed more resilient to mutations, why is this the case? Conversely, if small mutations change α- or β-only proteins’ structures to make new folds, having any speculation on the underlying mechanism (e.g. maybe α-only proteins are only sensitive to radically different amino acid substitutions, such as ALA->ARG) will only help our prediction methods. Overall I had the impression that the author was a bit too pessimistic about what modelling can achieve. Though we definitely cannot model all proteins that are out there at present, I believe the surge of new sources of data (e.g. cryo-EM structures) will provide an alternative inference route for better prediction methods in the future.