Network Hubs

Some times real networks contain few nodes that are connected to a large portion of the nodes in the network. These nodes, often called ‘hubs’ (or global hubs), can change global properties of the network drastically, for example the length of the shortest path between two nodes can be significantly reduced by their presence.

The presence of hubs in real networks can be easily observed, for example, in flight networks airports such as Heathrow (UK) or Beijing capital IAP (China) have a very large number of incoming and outgoing flights in comparison to all other airports in the world. Now, if in addition to the network there is a partition of the nodes into different groups ‘local hubs’ can appear. For example, assume that the political division is a partition of the nodes (airports) into different countries. Then, some capital city airports can be local hubs as they have incoming and outgoing flights to most other airports in that same country. Note that a local hub might not be a global hub.

There are several ways to classify nodes based on different network properties. Take for example, hub nodes and non-hub nodes. One way to classify nodes as hub or non-hub uses the participation coefficient and the standardised within module degree (Gimera & Amaral, 2005).

Consider a partition of the nodes into $N_M$ groups. Let $k_i$ be the degree of node $i$ and $k_{is}$ the number of links or edges to other nodes in the same group as node $i$ . Then, the participation coefficient of node $i$ is:

$P_i = 1 - \sum_{s=1}^{N_M} k_{is}^2 / k_i^2$ .

Note that if node $i$ is connected only to nodes within its group then, the participation coefficient of node $i$ is 0. Otherwise if it is connected to nodes uniformly distributed across all groups then the participation coefficient is close to 1 (Gimera & Amaral, 2005).

Now, the standardised within module degree:

$z_i= (k_i - \bar{k}_{s_i}) / \sigma_{k_{s_i}}$ ,

where $s_i$ is the group node $i$ belongs to and $\sigma_{k_{s_i}}$ is the standard deviation of $k$ in such group.

Gimera & Amaral (2005) proposed a classification of the nodes of the network based on their corresponding values of the previous statistics. In particular they proposed a heuristic classification of the nodes depicted by the following plot

Image taken from the paper “Functional cartography of complex
metabolic networks” by Guimera and Amaral, 2005.

Guimera and Amaral (2005), named regions R1-R4 as non-hub regions and R5-R7 as hub regions. Nodes belonging to: R1 are labelled as ultra-peripheral nodes, R2 as peripheral nodes, R3 as nun-hub connector nodes, R4 as non-hub kinless nodes, R5 as provincial nodes, R6 as connector hubs and R7 as kinless hubs. For more details on this categorisation please see Guimera and Amaral (2005).

The previous regions give an intuitive classification of network nodes according to their connectivity under a given partition of the nodes. In particular it gives an easy way to differentiate hub nodes of non-hub nodes. However the classification of the nodes into these seven regions (R1-R7) depends on the initial partition of the nodes.

R. Guimerà, L.A.N. Amaral, Functional cartography of complex metabolic networks, Nature 433 (2005) 895–900

Next generation sequencing of paired heavy and light chain sequences

At the last meeting before Christmas I covered the article by DeKosky et al. describing a new methodology for sequencing of paired VH-VL repertoire developed by the authors.

In the recent years there have been an exponential growth of available antibody sequences, caused mainly by the development of cheap and high-throughput Next Generation Sequencing (NGS) technologies. This trend led to the creation of several publicly available antibody sequence databases such as the DIGIT database and the abYsis database, containing hundreds of thousands of unpaired light chain and heavy chain sequences from over 100 species. Nevertheless, the sequencing of paired VH-VL repertoire remained a challenge, with the available techniques suffering from low throughput (<700 cells) and high cost. In contrast, the method developed by DeKosky et al. allows for relatively cheap paired sequencing of most of the 10^6 B cells contained within a typical 10-ml blood draw.

The work flow is as follows: first the isolated cells, dissolved in water, and magnetic poly(dT) beads mixed with cell lysis buffer are pushed through a narrow opening into a rapidly moving annular oil phase, resulting in a thin jet that coalescences into droplets, in such a way that each droplet has a very low chance of having a cell inside it. This ensures that the vast majority of droplets that do contain cells, contain only one cell each. Next, the cell lysis occurs within the droplets and the mRNA fragments coding for the antibody chains attach to the poly(dT) beads. Following that, the mRNA fragments are recovered and linkage PCR is used to generate 850 bp cDNA fragments for NGS.

To analyse the accuracy of their methodology the authors sequenced paired CDR-H3 – CDR-L3 sequences from blood samples obtained from three different human donors, filtering the results by 96% clustering, read-quality and removing sequences with less than two reads. Overall, this resulted in ~200,000 paired CDR-H3 – CDR-L3 sequences. The authors found that pairing accuracy of their methodology was ~98%.

The article also contained some bioinformatics analysis of the data. The authors first analysed CDR-L3 sequences that tend to pair up with many diverse CDR-H3 sequences and whether such “promiscuous” CDR-L3s are also “public” i.e. they are promiscuous and common in all three donors. Their results show that out of 50 most common promiscuous CDR-L3s 49 are also public. The results also show that the promiscuous CDR-L3s show little to no modification, being very close to the germline sequence.

Illustration of the sequencing pipeline

The sequencing data also contained examples of allelic inclusion, where one B-cell expresses two B cell receptors (almost always one VH gene and two distinct VL genes). It was found that about ~0.5% of all analysed B-cells showed allelic inclusion.

Finally, the authors looked at the occurrence of traits commonly associated with broadly Neutralizing Antibodies (bNAbs), produced to fight rapidly mutating pathogens (such as the influenza virus). These traits were short (<6 aa) CDR-L3 and long (11 – 18 aa) CDR-H3s. In total, the authors found 31 sequences with these features, suggesting that bNAbs can be found in the repertoire of healthy donors.

Overall this article presents very interesting and promising method, that should allow for large-scale sequencing of paired VH-VL sequences.

We can model everything, right…?

First, happy new year to all our Blopig fans, and we all hope 2016 will be awesome!

A couple of months ago, I was covering this article by Shalom Rackovsky. The big question that jumps out of the paper is, has modelling reached its limits? Or, in other words, can bioinformatics techniques be used to model every protein? The author argues that protein structures have an inherent level of variability that cannot be fully captured by computational methods; thus, he raises some scepticism on what modelling can achieve. This isn’t entirely news; competitions such as CASP show that there’s still lots to work on in this field. This article takes a very interesting spin when Rackovsky uses a theoretical basis to justify his claim.

For a pair of proteins P and Q, Rackovsky defines their relationship depending on their sequence and structural identity. If P and Q share a high level of sequence identity but have little structural resemblance, P and Q are considered to be a conformational switch. Conversely, if P and Q share a low level of sequence identity but have high structural resemblance, they are considered to be remote homologues.

Case of a conformational switch – two DNAPs with 100% seq identity but 5.3A RMSD.

Haemoglobins are ‘remote homolgues’ – despite 19% sequence identity, these two proteins have 1.9A RMSD.

From here on comes the complex maths. Rackovsky’s work here (and in papers prior, example) assume that there are periodicities in properties of proteins, and thus apply fourier transforms to compare protein sequences and structures.

In the case of comparing protein sequences, instead of treating sequences as a string of letters, protein sequences are characterised by an N x 10 matrix. N represents the number of amino acids in protein P (or Q), and each amino acid has 10 biophysical properties. The matrix then undergoes Fourier Transformation (FT), and the resulting sine and cosine coefficients for proteins P and Q are used to calculate the Euclidean distance between each other.

When comparing structures, proteins are initially truncated into length-L fragments, and the dihedral angle, bond length and bond angle for each fragment is collected into a matrix. The distribution of matrices allows us to project proteins onto a pre-parameterised principal components space. The Euclidean distance between the newly-projected proteins is then used to quantify protein structural similarity.

In both sequence and structure distances, the distances are normalised and centred around 0,0 by calculating the average distance between P and its M-nearest neighbours, and then adjusted by the global average. Effectively, if a protein has an average structural distance, it will tend toward 0,0.

The author uses a dataset of 12000 proteins from the CATH set to generate the following diagram; the Y-axis represents sequence similarity and the X-axis is the structural similarity. Since these axes are scaled to the mean, the closer you are to 0, it means you’re closer to the global average sequence or structure distance.

The four quadrants: along the diagonal is a typical linear relationship (greater sequence identity = more structural similarity). The lower-right quadrant represents proteins with LOW sequence similarity yet HIGH structural similarity. In the upper-left quadrant, proteins have LOW structural similarity but HIGH sequence similarity.

Rackovsky argues that, while the remote homologue and conformational switch seem like rare phenomena, it accounts for approximately ~50% of his dataset. Although he does account for the high density of proteins within 0,0, the paper does not clearly address the meaning of these new metrics. In other words, the author does not translate these values to something we’re more familiar with (e.g.RMSD, and sequence identity % for structural and sequence distance). Although the whole idea is that his methods are supposed to be an alignment-free method, it’s still difficult to draw relationships to what we already use as the gold standard in traditional protein structure prediction problems. Also, note that the structure distance spans between -0.1 and 0.1 units whereas sequence identity spans between -0.3 and 0.5. The differences in scale are also not covered – i.e., is a difference of 0.01 units an expected value for protein structure distance, and why are the jumps in protein structure distance so much smaller than jumps in sequence space?

The author makes more interesting observations in the dataset (e.g. α/β mixed proteins are more tolerant to mutations in comparison to α- or β-only proteins) but the observations are not discussed in depth. If α/β-mixed proteins are indeed more resilient to mutations, why is this the case? Conversely, if small mutations change α- or β-only proteins’ structures to make new folds, having any speculation on the underlying mechanism (e.g. maybe α-only proteins are only sensitive to radically different amino acid substitutions, such as ALA->ARG) will only help our prediction methods. Overall I had the impression that the author was a bit too pessimistic about what modelling can achieve. Though we definitely cannot model all proteins that are out there at present, I believe the surge of new sources of data (e.g. cryo-EM structures) will provide an alternative inference route for better prediction methods in the future.

Isoform-resolved protein interaction networks and poker

Every time I talk about protein interaction networks, I put up a the nice little figure below. This figure suggests that experiments are done to detect which proteins interact with each other, and lines are drawn between them to create a network. Sounds pretty easy, right? It does, and it probably should, because otherwise you wouldn’t listen past the words “Yeast Two-Hybrid”. However, sometimes it’s important to hear about how it’s not at all easy, how there are compromises made, and how those compromises mean there are some inherent limitations. This post is one of those things it is important to listen to if you care about protein interaction networks (and if you don’t… they’re pretty cool, so don’t be a downer!).

Classical image showing the design of a protein interaction network

Schematic image showing the assembly of a human protein interaction network with ~25 000 interactions

So what’s wrong with that nice figure? Just to deal with the obvious: the colour scheme isn’t great… there’s a red protein and the interactions are also red… and then a yellow protein and red interactions aren’t that easy on the eyes. But you’re not here to judge my choice of colours (I hope… otherwise you’d be better off criticizing the graphs in this beast). You’re here to hear me rant about networks… much more fun ;). So here goes:

Not all interactions come from the same experiments, they investigate different “types” of binding.
Each experiment has experimental errors associated with it, the interactions are not all correct.
The network is not complete (estimation is that there are somewhere between 150k and 600k interactions depending on what paper you read).
People tend to investigate proteins that they know are associated with a disease, so more interactions are known for certain proteins and their neighbours resulting in an “inspection bias”.
That’s a stupid depiction of a network, it’s just a blue blob and you can’t see anything (hence termed “ridiculogram” by Mark Newman).
The arrow is wrong, you are not reporting interactions between proteins!

Points 1-5 should more or less make sense as they’re written. Point 6 however sounds a little cryptic. They’re called PROTEIN interaction networks, why would they not be interactions between proteins, you might ask.. and you’d be right in asking, because it’s really quite annoying. The problem is one of isoforms. The relation gene to protein is not 1-to-1. After a gene is transcribed into RNA, that piece of RNA is cut up and rearranged into mRNA, which is then translated into a protein. This rearranging process can occur in different ways so that a single gene may encode for different proteins, or as they are called, different protein isoforms. Testing for interactions between isoforms is difficult, so what tends to be done is that people test for interactions for THE ONE isoform of a protein to rule them all (the “reference isoform”) and then report these interactions as interactions for the gene. Sneaky! What you end up seeing are interactions mostly tested between reference isoforms (or any that happened to be in the soup) and reported as interactions for the gene product.

So how much does it matter if we don’t know isoform interaction information? Are there even different interacting partners for different isoforms? Do they have different functions? Well… yes, they can have different interactions and different functions. As to whether they matter… according to Corominas et al that answer is also a resounding yes… or at least in Autism Spectrum Disorder (ASD) it is.

The paper is the result of a 5-year investigation which investigates isoform interactions and the effect of knowing them vs not knowing them on predicting candidate ASD genes of interest. And seeing as a bunch of people spent a lot of time on this stuff, it was definitely worth a read. Corominas et al found that in an ASD-related protein interaction network, there is a significant number of interactions that would not be found if only the reference isoform interactions were used. Furthermore, compared to a “high-quality” literature curated network, the generated isoform-resolved ASD network added a lot of interactions. They then went on to show that these additional interactions played an important role in the decision of which genes to prioritize as important “players in Autism”.

Should these results make us renounce the use of currently available non-isoform-resolved protein interaction networks lest we burn in the depths of bioinformatics hell? Well… probably not. While the paper is very interesting and shows the importance of isoforms, it does so in the context of Autism only. The paper itself states that ASD is a brain-related disease which is an environment known for many isoforms. In many cases, it will likely be the case that the “dominant isoform” is just that… dominant. Moreover, the results may sound a little stronger than they are. The literature curated network that was compared to, to say that this isoform-resolved network is really important, was quoted as being “high-quality”. It is likely that many of the isoform interactions would be included in lower quality networks, but they have simply not been as well-studied as dominant isoforms. Thus, their isoform-resolved network would just confirm lower quality interactions as high-quality ones. That being said, if you want to look at the specific mechanisms causing a phenotype, it is likely that isoform information will be necessary. It really depends on what you want to achieve.

Let’s say you’re playing Texas Hold’em poker and you have two pairs. You’d like to have that full house, but it’s elusive and you’re stuck with the hand you have when your opponent bids high. That’s the situation we were in with protein interaction networks: you know you’re missing something, but you don’t know how bad it is that you’re missing it. This paper addresses part of that problem. We now know that your opponent could have the flush, but possibly only if you’re in Vegas. If you only want to play in a local casino, you’ll likely be fine.

Molecular Diversity and Drug Discovery

For my second short project I have developed Theox, molecular diversity software, to aid the selection of synthetically viable molecules from a subset of diverse molecules. The selection of molecules for synthesis is currently based on synthetic intuition. The developed software indicates whether the selection is an adequate representation of the initial dataset, or whether molecular diversity has been compromised. Theox plots the distribution of diversity indices for 10,000 randomly generated subsets of the same size as the chosen subset. The diversity index of the chosen subset can then be compared to the distributions, to determine whether the molecular diversity of the chosen subset is sufficient. The figure shows the distribution of the Tanimoto diversity indices with the diversity index of the subset of molecules shown in green.

Racing along transcripts: Correlating ribosome profiling and protein structure.

A long long time ago, in a galaxy far away, I gave a presentation about the state of my research to the group (can you tell I’m excited for the new Star Wars!). Since then, little has changed due to my absenteeism from Oxford, which means ((un)luckily) the state of work is by and large the same. Now, my work focusses on the effect that the translation speed of a given mRNA sequence can have on the eventual protein product, specifically through the phenomena of cotranslational folding. I’ve discussed the evidence behind this in prior posts (see here and here), though I find the below video a good reminder of why we can’t always just go as fast as we like.

So given that translation speed is important, how do we in fact measure it? Traditional measures, such as tAI and CAI, infer them using the codon bias within the genome or by comparing the counts of tRNA genes in a genome. However, while these have been shown to somewhat relate to speed, they are still solely theoretical in their construction. An alternative is ribosome profiling, which I’ve discussed in depth before (see here), which provides an actual experimental measure of the time taken to translate each codon in an mRNA sequence. In my latest work, I have compiled ribosome profiling data from 7 different experiments, consisting of 6 diverse organisms and processed them all in the same fashion from their respective raw data. Combined, the dataset gives ribosome profiling “speed” values for approximately 25 thousand genes across the various organisms.

Our first task with this dataset is to see how well the traditional measures compare to the ribosome profiling data. For this, we calculated the correlation against CAI, MinMax, nTE and tAI, with the results presented in the figure above. We find that basically no measure adequately captures the entirety of the translation speed; some measures failing completely, others obviously capturing some part of the behaviour, and then some others even predicting the reverse! Given that no measure captured the behaviour adequately, we realised that existing results that related the translation speed to the protein structure, may, in fact, be wrong. Thus, we decided that we should recreate the analysis using our dataset to either validate or correct the original observations. To do this we combined our ribosome profiling dataset with matching PDB structures, such that we had the sequence, the structure, and the translation speed for approximately 4500 genes over the 6 species. While I won’t go in to details here (see upcoming paper – touch wood), we analysed the relationship between the speed and the solvent accessibility, the secondary structure, and linker regions. We found striking differences to the observations found in the literature that I’ll be excited to share in the near future.

Journal Club: Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo

For this week’s journal club, I presented the paper by Goldman et al, “Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo”. The reason for choosing this paper is that it discussed an influence on protein folding/creation/translation that is not considered in any of today’s modelling efforts and I think it is massively important that every so often we, as a community, step-back and appreciate the complexity of the system we attempt to understand. This work focuses on the the SecM protein, which is known to regulate SecA (which is part of the translocon) which in turn regulates SecM. The bio-mechanical manner in which this regulation takes place is not fully understood. However, SecM contains within its sequence a peptide motiff that binds so strongly to the ribosome tunnel wall that translation is stopped. It is hypothesised that SecA regulates SecM by applying a force to the nascent chain to pull it past this stalling point and, hence, allow translation to continue.

To begin their study, Goldman wanted to confirm that one could advance past the stall point merely by the application of force. By attaching the nascent chain and the ribosome to nano-tweezers and a micro-pipette respectively they could do this. However, to confirm that the system was stalled before applying a (larger) force, they created a sequence which included CaM, a protein which periodically hops between a folded and unfolded state when pulled at 7pN, followed by the section of SecM which causes the stalling. The nano-tweezers were able to sense the slight variations in length at 7pn from the unfolding and refolding of CaM, though no continuing extension, which would indicate translation, was found. This indicated the system had truly stalled due to the SecM sequence. Once at this point, Goldman increased the applied force, at which point distance between the pipette and the optical tweezers slowly increased until detachment when the stop codon was reached. As well as confirming that force on the nascent chain could make the SecM system proceed past the stalling point, they also noted a force dependence to the speed with which it would overcome this barrier.

With this force dependence established, they pondered whether a domain folding upchain of the stall point could generate enough force that it could cause translation to continue. To investigate, Goldman created a protein that contained Top7 followed by a linker of variable length, followed by the SeqM stalling motif, which was in turn followed by GFP. Shown in the figure above, altering the length of the linker region defined the location of Top7 while it attempts to fold. A long linker allows Top7 to fold completely clear from the ribosome tunnel. A short linker means that a it can’t fold due to many of its residues being inside the ribosome tunnel. Between these extremes, however, the protein may only have a few residues within the tunnel and by stretching the nascent chain it may access them so as to be able to fold. In addition, Top7 was chosen specifically as it was known to fold even under light pressure. Hence, by newtons third, Top7 would fold even while its C terminus would be under strain into the ribosome, it in turn generates an equal and opposite force on the stalling peptide sequence within the heart of the ribosome tunnel, which should allow translation to proceed past the stall. Crucially, if Top7 folded too far away from the ribosome, this interaction would not occur and translation would not continue.

Goldman’s experiments showed that this is in fact the case; they found that only linkers of 15 to 22 amino acid would successfully complete translation. This confirms that a protein folding at the mouth of the ribosome tunnel can generate sizeable force (they calculate roughly 12pN in this instance). Now I find this whole system especially interesting as the I wonder how this may generalise to all translation, both in terms of interactions of the nascent chain with the side wall and the domain folding at the ribosome tunnel mouth. Should I consider these when I calculate translation speeds for example? Oh well, we need a reasonable model for translation while ignoring these special cases first before I really need to worry!

A designed conformational shift to control protein binding specificity

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Short project: “Network Approach to Identifying the Mode of Action of Environmental Changes in Yeast”

I recently had the pleasure of working for 11 weeks with the wonderful people in OPIG. I studied protein interaction networks and how we might discern the parts of the network that are important for disease (and otherwise). In the past, people have looked at differential gene expression or used community detection to this end, but both of these approaches have drawbacks. The former misses the fact that biological systems are rarely just binary systems or interactions. Community detection addresses this, but it in turn does not take into account the dynamic nature of proteins in the cell – how do their interactions change over time? What about interactions or proteins that are only present in some cells? Community detection tries to look at all proteins and ignores important context like this.

My aim was to develop approaches that combined these elements. We used Pearson’s correlation coefficient on gene expression data and community detection on an interaction network. We showed that the distribution of the correlation of pairs of genes is weighted towards 1.0 for those that interact compared to those that do not, and for those in the same community compared to those that are not – see the figure above. We went on to assign a “score” to communities based on their correlation in each set of expression data. For example, one community might have a high score in expression data from cells undergoing amino acid starvation. We ended up with a list of communities which seemed to be important in certain environmental conditions. We made use of functional enrichment – drawing on the lovely Malte’s work – to try and verify these scores.

I had a great time with some lovely people and produced something that I thought was very interesting. I really hope I see this work pop up again and get taken to interesting places! So long, and thanks for all the cookies!

Click here for some more pretty plots and a code repository (by request only).

Journal Club: Accessing Protein Conformational Ensembles using RT X-ray Crystallography

This week I presented a paper that investigates the differences between crystallographic datasets collected from crystals at RT (room-temperature) and crystals at CT (cryogenic temperatures). Full paper here.

The cooling of protein crystals to cryogenic temperatures is widely used as a method of reducing radiation damage and enabling collection of whole datasets from a single crystal. In fact, this approach has been so successful that approximately 95% of structures in the PDB have been collected at CT.

However, the main assumption of cryo-cooling is that the “freezing”/cooling process happens quickly enough that it does not disturb the conformational distributions of the protein, and that the RT ensemble is “trapped” when cooled to CT.

Although it is well established that cryo-cooling of the crystal does not distort the overall structure or fold of the protein, this paper investigates some of the more subtle changes that cryo-cooling can introduce, such as the distortion of sidechain conformations or the quenching of dynamic CONTACT networks. These features of proteins could be important for the understanding of phenomena such as binding or allosteric modulation, and so accurate information about the protein is essential. If this information is regulartly lost in the cryo-cooling process, it could be a strong argument for a return to collection at RT where feasible.

By using the RINGER method, the authors find that the sidechain conformations are commonly affected by the cryo-cooling process: the conformers present at CT are sometimes completely different to the conformers observed at RT. In total, they find that cryo-cooling affects a significant number of residues (predominantly those on the surface of the protein, but also those that are buried). 18.9% of residues have rotamer distributions that change between RT and CT, and 37.7% of residues have a conformer that changes occupancy by 20% or more.

Overall, the authors conclude that, where possible, datasets should be collected at RT, as the derived models offer a more realistic description of the biologically-relevant conformational ensemble of the protein.

Oxford Protein Informatics Group

or "OPIG" to friends