GO annotations: Warning – secondary IDs!

I’m writing this post after a bit of a scare that has cost me at least 2 hours of my life so far, and possibly may cost me a lot more. This is something I wished I had read about before starting to use gene ontology annotations (GO terms), so hopefully this will reach some people that are in the position I was in a couple of years ago. The post assumes you know about the GO directed acyclic graph (DAG), which describes the relationships between GO terms, and a bit about functional similarity. Good reviews on this were written by Pesquita et al (2009) and Guzzi et al (2012)

 

Secondary GO-terms

Yesterday evening I discovered GO terms can have secondary accession numbers. For example the GO term GO:0060555 is a secondary ID for GO:0070266 which describes the biological process term “necroptotic process”. The existence of secondary IDs in itself is not particularly surprising given that ontologies like the gene ontology are at the forefront of research and thus also have to be updated with the latest knowledge. As is the case for the UniProt KB and the RefSeq databases identifiers are merged if they are found to refer to the same thing. RefSeq has a good overview for when and why this occurs in their FAQ. Keeping a record of secondary IDs in the database is important for compatibility with past applications of the ontology.

 

Why is this a problem?

This can become a problem when the reverse compatibility is not fully committed to. For example, if the downloadable annotation set for human proteins contains GO terms that are secondary IDs, while the downloadable ontology structure does not include these secondary IDs in the directed acyclic graph (DAG). This means someone may download all the annotations and check what their parent terms are in the DAG, but as these terms are not included in the DAG, they do not have a parent term set.

 

So why the scare?

It seems like this behaviour should become obvious very fast. It should either give a key error that you are looking for something that isn’t there, or an empty set should ring some alarm bells. The scare comes in when neither happens, and you just don’t notice it at all, only to find out a couple of years later…

In order to calculate the functional similarity between two proteins, I need the full GO term sets associated with every protein including ancestral terms. I have a local copy of the ontology structure which I can use to query for these ancestral terms based on the GO terms directly associated with the proteins. As I need to do this for between 6800 and 13000 proteins depending on the network, I automated the process. The problem is that MySQL returns an empty set when asking for a match on an accession number that isn’t there. Returning an empty set has the effect that terms not in the ontology structure are automatically filtered out. This came to light now that I’m looking to explicitly calculate the information content of each GO term (a metric for how prevalent a term is in a data set) and three terms could not be found in the ancestral term sets for any proteins.

 

How do I prevent this from happening to me?

As there are only few secondary ID terms associated with proteins in the downloadable association file, it is easy to manually look for mappings to the respective primary IDs in the EBI QuckGO tool. Then, before querying for ancestral term sets, just map the secondary IDs across to the newer IDs and you’re all set. It’s really not a hassle if you know it’s there. Sadly, there is only a sentence on the gene ontology site about secondary IDs and no mention whatsoever that these IDs are not incorporated in the GO-basic ontology download, which is recommended for finding ancestral term sets.

 

Are you okay now?

As I hinted at before, there are very few proteins annotated with secondary ID terms. In my case I found three secondary ID GO terms annotated to five different proteins, out of a total 6861 proteins I was checking. One protein was also annotated with the corresponding primary ID term, so there was no effect, and another protein was annotated with the only parent term of the secondary ID, which no other proteins were annotated to. Thus, the effect really boils down to three proteins with the same secondary ID. Of those three, only one is heavily affected in the worst case by its similarity score with 10 other proteins changing from 9.3 to 2.6 without the primary ID annotation (scores range from ~ 1.5 to 12). Those are 10 scores out of a total of approximately 24 000 000… I think I will survive. But I am yet to test on a further data set.

 

The effect is basically that you ignore a couple of annotations. Given that we do not have the full set of annotations anyway, this change is the same as if an experiment or two hadn’t been conducted. Furthermore, the three proteins I found to be affected had a lot of annotations, so missing one should not have the worst-case effect I investigated. Nonetheless, you may want to do it properly first time round and take all the data into account when you start out. It should save you a lot of hassle later.

 

Stay vigilant, my friends!

Ways to compare networks

Nowadays network comparison is becoming increasingly relevant. Why is this?  Mainly because it is a desirable way to compare complex systems that can often be represented as networks.

Network comparison aims to account for properties that are generated by the simultaneous interaction of all units rather than the properties of each single individual. Here are some cases where network comparison could be helpful:

–  Showing and highlighting “significant” changes on network evolution. A particular characteristic of interest could be the speed with which information flows.

– Quantifying how “close” two networks  are. Even when the networks have a different different number of nodes and edges, or in the case of spatially embedded networks, different scale.

As an example, look at the following two networks. Are the structures of these two road networks similar?

Screen Shot 2015-06-15 at 14.56.21

(Images obtained from the mac Maps app)

Or what about the structure of these two other networks?
two_geometric

One of the difficulties in comparing networks is that there is no clear way to compare networks as complete whole entities. Network comparison methods only compare certain attributes of the network, among these: density of edges, global clustering coefficient, degree distribution, counts of smaller graphs embedded in the network and others. Here are some ideas and methods that have been used as ways to compare networks.

  • Networks can be compared by their global properties and summary statistics, like network density, degree distribution, transitivity, average shortest path length and others. Usually a statistic combining the differences of several global properties was used.
  • Another way to compare networks is based on the fit of a statistical network model (eg.  ERGM) to each network. Then, the coefficients of the fitted models could be compared. Note that in this case, a good fit of the models would be required.
  • Statistics directly built for network comparison via subgraph counts. These statistics do not make any assumptions of the network generation process. For example, Netdis and GCD are two network comparison statistics that try to measure the structural difference between networks. These network comparison measures are based on counts of small subgraphs (3-5 nodes), like triangles, 2-stars, 3-stars, squares, cliques and others (see Figure below). subgraph_plot_jpeg These network comparison statistics create frequency vectors of subgraphs and then compare these frequencies between the networks to obtain an idea of the similarity of the networks relative to their subgraph counts.
  • Lastly, another way to indirectly compare networks is via network alignment methods. The objective of these methods is to create a “mapping”, f, from the node set of one network to the node set of another.  The following Figure shows two networks, light and dark blue. An alignment of the two networks is shown in red.Screen Shot 2015-06-15 at 21.43.29One of the objectives of an alignment is to maximise the number of conserved interactions, that is, if (u,v) is an edge in the first network and f is an alignment, then an edge (u,v) is conserved if (f(u),f(v)) is an edge in the second network. It can be noted that the edge demarcated by the green oval is a non-conserved interaction.
    NETAL is a commonly used network alignment method, although there are several, more recent, network alignment methodologies.

In the end, despite the variety of ways to compare networks, saying that two networks are similar, or different, is not that easy, as all methods face their own particular challenges, like networks that come from the same model but have different node size.

 

Diseases exploiting the Power of Networks

Networks are pretty amazing. If you have a bunch of nodes that sit there and do very simple tasks that you can train a dog to do, and then you connect them in the right way, you can create a programme that can get through a level of Mario! In neural networks, all the nodes are doing is getting an input, then deciding what works best based on that input, and sending that information somewhere else. It’s like teaching your dog to get help when you’re in trouble. You say “Help!”, it makes a decision where to go to find help, and then someone else is alerted to your situation. Now you link those inputs and outputs, and suddenly you can solve some very interesting problems. Like making your dog get another dog to do something, who tells another dog… okay… at some point the analogy may break down. What I want to get across, is that networks can take very simple things and create complexity by just adding a few connections between them.

Getting back to the biological theme, proteins aren’t really all that simple. From what you’ll have read on other posts in this blog there’s a lot of effort that goes into modelling them. The position of even a small peptide loop can have a big effect on how a protein functions. So if you have a network of 15 – 25 thousand proteins you can imagine that the complexity of that system is quite incredible. That system is the human body.

Looking at systems from the perspective of interactions between components creating complex behaviour, it’s really not that surprising that human diseases often function by disrupting the network of protein interactions as found by this paper recently published in Cell. Assuming there’s a human disease which has survived for generations and generations, it is likely to be quite effective in how it functions. And if you have a complex system like the human body, the easiest way to disrupt that arising behaviour called “a healthy life”, is to alter a few of the interactions. Showing this for a large group of disease-associated missense mutations is however pretty impressive and time-consuming, which is probably why there are about 50 people from over 20 different institutions in the author list.

So what exactly did they show? They showed what happens when there is a mutation in a gene that causes a disease. Such a mutation replaces an amino acid in the protein sequence encoded by the gene. The resulting protein is then somehow different. The group around Marc Vidal and Susan Lindquist showed that rather than affecting the stability of the protein, the system tends to be targeted via the interactions of the protein. What is more, they showed that different mutations of the same gene can affect different sets of interactions. Using their “edgotyping” approach it is possible to characterize the type of effect a mutation will have and possibly predict the change in phenotype (i.e. the disease state).

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Now if the possibility of predicting how a single mutation (1 in ~300 amino acids), in a single protein (1 in ~ 25,000 proteins) affects the human body doesn’t convince you that networks are pretty powerful, I really don’t know what will. But now back to the important stuff…. how do you make a real life neural network with communicating dogs? And could you make them reach level two on Mario? Or… Super Mario Dog?

Journal Club: AbDesign. An algorithm for combinatorial backbone design guided by natural conformations and sequences

Computational protein design methods often use a known molecule with a well-characterised structure as a template or scaffold. The chosen scaffold is modified so that its function (e.g. what it binds) is repurposed. Ideally, one wants to be confident that the expressed protein’s structure is going to be the same as the designed conformation. Therefore, successful designed proteins tend to be rigid, formed of collections of regular secondary structure (e.g. α-helices and β-sheets) and have active site shapes that do not perturb far from the scaffold’s backbone conformation (see this review).

A recent paper (Lapidoth et al 2015) from the Fleishman group proposes a new protocol to incorporate backbone variation (read loop conformations) into computational protein design (Figure 1). Using an antibody as the chosen scaffold, their approach aims to design a molecule that binds a specific patch (epitope) on a target molecule (antigen).

Fig1

Figure 1 from Lapidoth et al 2015 shows an overview of the AbDesign protocol

Protein design works in the opposite direction to structure prediction. i.e. given a structure tell me what sequence will allow me to achieve that shape and to bind a particular patch in the way I have chosen. To do this one first needs to select a shape that could feasibly be achieved in vivo. We would hope that if a backbone conformation has previously been seen in the Protein Data Bank that it is one of such a set of feasible shapes.

Lapidoth et al sample conformations by constructing a backbone torsion angle database derived from known antibody structures from the PDB. From the work of North et al and others we also know that certain loop shapes can be achieved with multiple different sequences (see KK’s recent post). The authors therefore reduce the number of possible backbone conformations by clustering them by structural similarity. Each conformational cluster is represented by a representative and a position specific substitution matrix (PSSM). The PSSM represents how the sequence can vary whilst maintaining the shape.

The Rosetta design pipeline that follows uses the pre-computed torsion database to make a scaffold antibody structure (1x9q) adopt different backbone conformations. Proposed sequence mutations are sampled from the corresponding PSSM for the conformation. Shapes and the sequences that can adopt them, are ranked with respect to a docked pose with the antigen using several structure-based filters and Rosetta energy scores. A trade off is made between predicted binding and stability energies using a ‘fuzzy logic’ scheme.

After several rounds of optimisation the pipeline produces a predicted structure and sequence that should bind the chosen epitope patch and fold to form a stable protein when expressed. The benchmark results show promise in terms of structural similarity to known molecules that bind the same site (polar interactions, buried surface area). Sequence similarity between the predicted and known binders is perhaps lower than expected. However, as different natural antibody molecules can bind the same antigen, convergence between a ‘correct’ design and the known binder may not be guaranteed anyway.

In conclusion, my take home message from this paper is that to sensibly sample backbone conformations for protein design use the variation seen in known structures. The method presented demonstrates a way of predicting more structurally diverse designs and sampling the sequences that will allow the protein to adopt these shapes.  Although, as the authors highlight, it is difficult to assess the performance of the protocol without experimental validation, important lessons can be learned for computational design of both antibodies and general proteins.

5 Thoughts For… Comparing Crystallographic Datasets

Most of the work I do involves comparing diffraction datasets from protein crystals. We often have two or more different crystals of the same crystal system, and want to spot differences between them. The crystals are nearly isomorphous, so that the structure of the protein (and crystal) is almost identical between the two datasets. However, it’s not just a case of overlaying the electron density maps, subtracting them and looking at the difference. Nor do we necessarily want to calculate Fo-Fo maps, where we calculate the difference by directly subtracting the diffraction data before calculating maps. By the nature of the crystallographic experiment, no two crystals are the same, and two (nearly identical) crystals can lead to two quite different datasets.

So, here’s a list of things I keep in mind when comparing crystallographic datasets…

Control the Resolution Limits

1) Ensure that the resolution limits in the datasets are the same, both at the high AND the low resolution limits.

The High resolution limit. The best known, and (usually) the most important statistic of a dataset. This is a measure of the amount of information that’s been collected about the dataset. Higher resolution data gives more detail for the electron density. Therefore, if you compare a 3A map to a 1A map, you’re comparing fundamentally different objects, and the differences between them will be predominantly from the different amount of information in each dataset. It’s then very difficult to ascertain what’s interesting, and what is an artefact of this difference. As a first step, truncate all datasets at the resolution you wish to compare them at.

The Low Resolution Limit. At the other end of the dataset, there can be differences in the low resolution data collected. Low resolution reflections correspond to much larger-scale features in the electron density. Therefore, it’s just as important to have the same low-resolution limit for both datasets, otherwise you get large “waves” of electron density (low-frequency fourier terms) in one dataset that are not present in the other. Because low-resolution terms are much stronger than high resolution reflections, these features stand out very strongly, and can also obscure “real” differences between the datasets you’re trying to compare. Truncate all datasets at the same low resolution limit as well.

Consider the Unit Cell

2) Even if the resolution limits are the same, the number of reflections in maps can be different.

The Unit Cell size and shape. Even if the crystals you’re using are the same crystal form, no two crystals are the same. The unit cell (the building block of the crystal) can be slightly different sizes and shapes between crystals, varying in size by a few percent. This can occur by a variety of reasons, from the unpredictable process of cooling the crystal to cryogenic temperatures to entirely stochastic differences from the process of crystallisation. Since the “resolution” of reflections depends on the size of the unit cell, two reflections with the same miller index can have different “resolutions” when it comes to selecting reflections for map calculation. Therefore, if you’re calculating maps from nearly-isomorphous but non-identical crystals, consider calculating maps based on an high and a low miller index cutoff, rather than a resolution cutoff. This ensures the same amount of information in each map (number of free parameters).

Watch for Missing Reflections

3) Remove any missing reflections from both datasets.

Reflections can be missing from datasets for a number of reasons, such as falling into gaps/dead pixels on the detector. However, this isn’t going to happen systematically with all crystals, as different crystals will be mounted in different orientations. When a reflection is missed in one dataset, it’s best to remove it from the dataset you’re comparing it to as well. This can have an important effect when the completeness of low- or high-resolution shells is low, whatever the reason.

Not All Crystal Errors are Created Equal…

4) Different Crystals have different measurement errors.

Observation uncertainties of reflections will vary from crystal to crystal. This may be due to a poor-quality crystal, or a crystal that has suffered from more radiation damage than another. These errors lead to uncertainty and error in the electron density maps. Therefore, if you’re looking for a reference crystal, you probably want to choose one with as small uncertainties, σ(F), in the reflections as possible.

Proteins are Flexible

5) Even though the crystals are similar, the protein may adopt slightly difference conformations.

In real-space, the protein structure varies from crystal to crystal. For the same crystal form, there will be the same number of protein copies in the unit cell, and they will be largely in the same conformation. However, the structures are not identical, and the inherent flexibility of the protein can mean that the conformation seen in the crystal can change slightly from crystal to crystal. This effect is largest in the most flexible regions of the protein, such as unconstrained C- and N- termini, as well as flexible loops and crystal contacts.

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred  superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm)  and  interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python.  The way you set this up is:

C code rmsd.cpp

extern “C” {
    void superpose_c(const char* file1, consta char* file2, double* rmsd){
        //perform superposition
        *rmsd = compute_rmsd();
    }
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./rmsdlib.so’)

#convert python string to c char array function
def to_c_string(py_string)
    c_string = c_char * (len(py_string)+1)()
    for i in range(len(py_string)):
        c_string[i] = py_string[i]
    c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
    output_mtx.lock();
    //Unsafe operation
    files.append(filename);
    output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.

 

 

 

 

 

 

Ten Simple Rules for a Successful Cross-Disciplinary Collaboration

The name of our research group (Oxford Protein Informatics Group) already indicates its cross-disciplinary character. In doing research of this type, we can acquire a lot of experience in working across the boundaries of research fields. Recently, one of our group members, Bernhard Knapp, became lead author of an article about guidelines for cross-disciplinary research. This article describes ten simple rules which you should consider if working across several disciplines. They include going to the other lab in person, understanding different rewards models, having patience with the pace of other disciplines, and recognising the importance of synergy.

The ten rules article was even picked up by a journalist of the “Times Higher Education” and further discussed in the newspaper.

Happy further interdisciplinary work!

 

Clustering Algorithms

Clustering is a task of organizing data into groups (called clusters), such that members of each group are more similar to each other than to members of other groups. This is a brief description of three popular clustering algorithms – K-Means, UPGMA and DBSCAN.

Cluster analysis

Cluster analysis

K-Means is arguably the simplest and most popular clustering algorithm. It takes one parameter – the expected number of clusters k. At the initialization step a set of k means m1, m2,…, mk is generated (hence the name). At each iteration step the objects in the data are assigned to the cluster whose mean yields the least within-cluster sum of squares. After the assignments, the means are updated to be centroids of the new clusters. The procedure is repeated until convergence (the convergence occurs when the means no longer change between iterations).

The main strength of K-means is that it’s simple and easy to implement. The largest drawback of the algorithm is that one needs to know in advance how many clusters the data contains. Another problem is that with wrong initialization the algorithm can easily converge to a local minimum, which may result in suboptimal partitioning of data.

UPGMA is a simple hierarchical clustering method, where the distance between two clusters is taken to be the average of distances between individual objects in the clusters. In each step the closest clusters are combined, until all objects are in clusters where the average distance between objects is lower than a specified cut-off.

The UPGMA algorithm is often used for construction of phenetic trees. The major issue with the algorithm is that the tree it constructs is ultrametric, which means that the distance from root to any leaf is the same. In context of evolution, this means that the UPGMA algorithm assumes a constant rate of accumulation of mutations, an assumption which is often incorrect.

DBSCAN is a density-based algorithm which tries to separate the data into regions of high density, labelling points that lie in low-density areas as outliers. The algorithm takes two parameters – ε and minPts and looks for points that are density-connected with respect to ε. A point p is said to be density reachable from point q if there are points between p and q, such that one can traverse the path from p to q never moving further than ε in any step. Because the concept of density-reachability is not symmetric a concept of density-connectivity is introduced. Two points p and q are density-connected if there is a point o such that both p and q are density reachable from o. A set of points is considered a cluster if all points in the set are mutually density-connected and the number of points in the set is equal to or greater than minPts. The points that cannot be put into clusters are classified as noise.

a)Illustrates the concept of density-reachability.  b)  Illustrates the concept of density-connectivity

a) Illustrates the concept of density-reachability.
b) Illustrates the concept of density-connectivity

The DBSCAN algorithm can efficiently detect clusters with non-globular shapes since it is sensitive to changes in density only. Because of that, the clustering reflects real structure present in the data. The problem with the algorithm is the choice of parameter ε which controls how large the difference in density needs to be to separate two clusters.

Investigating structural mechanisms with customised Natural Moves

This is a brief overview of my current work on a protocol for studying molecular mechanisms in biomolecules. It is based on Natural Move Monte Carlo (NMMC), a technique pioneered by one of my supervisors, Peter Minary.

NMMC allows the user to decompose a protein or nucleic acid structure into segments of atoms/residues. Each segment is moved collectively and treated as a single body. This gives rise to a sampling strategy that considers the structure as a fluid of segments and probes the different arrangements in a Monte Carlo fashion.

Traditionally the initial decomposition would be the only one that is sampled. However, this decomposition might not always be optimal. Critical degrees of freedom might have been missed out or chosen sub-optimally. Additionally, if we want to test the causality of a structural mechanism it can be informative to perform NMMC simulations for a variety of decompositions. Here I show an example of how customised Natural Moves may be applied on a DNA system.


Investigating the effect of epigenetic marks on the structure of a DNA toy model

epigeneticsintro

Figure 1. Three levels at which epigenetic activity takes place. Figure adopted from Charles C. Matouk, and Philip A. Marsden Circulation Research. 2008;102:873-887

Epigenetic marks on DNA nucleotides are involved in regulating gene expression (Point 1 in figure 1). We have a limited understanding of the underlying molecular mechanism of this process. There are two mechanisms that are thought to be involved: 1) Direct recognition of the epigenetic mark by DNA binding proteins and 2) indirect recognition of changes in the local DNA structure caused by the epigenetic mark. Using customised Natural Moves we are currently trying to to gain insight into these mechanisms.

One type of epigenetic mark is the 5-hydroxymethylation (5hm) on cytosines. Lercher et al. have recently solved a crystal structure of the Drew-Dickerson Dodecamer with two of these epigenetic marks attached. Given the right sequence context (e.g. CpG) they have found that this epigenetic mark can form a hydrogen bond between two neighbouring bases on the same strand. This raises the question: Can an intra-strand CpG hydrogen bond alter the DNA helical parameters? We used this as a toy model to test our technology.

cDOFsSchematic

Figure 2b

dna_schematic

Figure 2a

Figure 2a shows the Drew-Dickerson Dodecamer schematically. Figure 2b shows the three sets of degrees of freedom that we used as test cases.

Top (11): Default sampling – Serves as a reference simulation.
Middle (01): Fixed 5hm torsions – By forcing the hydroxyl group of 5hmC towards the guanine we significantly increase the chance of the hydrogen bond forming.
Bottom (00): Collective movement of the neighbouring bases 5hmC and G – By grouping the two bases into a segment we aim to emulate the dampening effect that a hydrogen bond may have on their movements relative to each other.

By modifying these degrees of freedom (1=active/ 0=inactive) we attempted to amplify any effects that the CpG hydrogen bond may have on the DNA structure.

Below you can see animations of the three test cases 11, 01, 00 during simulation, zoomed in on the 5hm modification and the neighbouring base pair:

fullfullfixedCHMfullfixedGC


It appeared that the default degrees of freedom were not sufficient to detect a change in the DNA structure when comparing simulations of unmodified and modified structures. The other two test cases with customised Natural Moves, however, showed alterations in some of the helical parameters.

Lercher et al. saw no differences in their modified and unmodified crystal structures. It seems that single 5hm epigenetic marks are not sufficient to significantly alter DNA structure. Rather, clusters of these modifications with accumulated structural effects may be required to cause significant changes in DNA helical parameters. CpG islands may be a promising candidate.

@samdemharter

SAS-5 assists in building centrioles of nematode worms Caenorhabditis elegans

We have recently published a paper in eLife describing the structural basis for the role of protein SAS-5 in initiating the formation of a new centriole, called a daughter centriole. But why do we care and why is this discovery important?

We, as humans – a branch of multi-cellular organisms, are in constant demand of new cells in our bodies. We need them to grow from an early embryo to adult, and also to replace dead or damaged cells. Cells don’t just appear from nowhere but undergo a tightly controlled process called cell cycle. At the core of cell cycle lies segregation of duplicated genetic material into two daughter cells. Pairs of chromosomes need to be pulled apart millions of millions times a day. Errors will lead to cancer. To avoid this apocalyptic scenario, evolution supplied us with centrioles. Those large molecular machines sprout microtubules radially to form characteristic asters which then bind to individual chromosomes and pull them apart. In order to achieve continuity, centrioles duplicate once per cell cycle.

Similarly to many large macromolecular assemblies, centrioles exhibit symmetry. A few unique proteins come in multiple copies to build this gigantic cylindrical molecular structure: 250 nm wide and 500 nm long (the size of a centriole in humans). The very core of the centriole looks like a 9-fold symmetrical stack of cartwheels, at which periphery microtubules are vertically installed. We study protein composition of this fascinating structure in the effort to understand the process of assembling a new centriole.

Molecular architecture of centrioles.

SAS-5 is an indispensable component in C. elegans centriole biogenesis. SAS-5 physically associates with another centriolar protein, called SAS-6, forming a complex which is required to build new centrioles. This process is regulated by phosphorylation events, allowing for subsequent recruitment of SAS-4 and microtubules. In most other systems SAS-6 forms a cartwheel (central tube in C. elegans), which forms the basis for the 9-fold symmetry of centrioles. Unlike SAS-6, SAS-5 exhibits strong spatial dynamics, shuttling between the cytoplasm and centrioles throughout the cell cycle. Although SAS-5 is an essential protein, depletion of which completely terminates centrosome-dependent cell division, its exact mechanistic  role in this  process remains  obscure.

IN BRIEF: WHAT WE DID
Using X-ray crystallography and a range of biophysical techniques, we have determined the molecular architecture of SAS-5. We show that SAS-5 forms a complex oligomeric structure, mediated by two self-associating domains: a trimeric coiled coil and a novel globular dimeric Implico domain. Disruption of either domain leads to centriole duplication failure in worm embryos, indicating that large SAS-5 assemblies are necessary for function. We propose that SAS-5 provides multivalent attachment sites that are critical for promoting assembly of SAS-6 into a cartwheel, and thus centriole formation.

For details, check out our latest paper 10.7554/eLife.07410!

@kbrogala

Top panel: cartoon overview of the proposed mechanism of centriole formation. In cytoplasm, SAS-5 exists at low concentrations as a dimer, and each of those dimers can stochastically bind two molecules of SAS-6. Once SAS-5 / SAS-6 complex is targeted to the centrioles, it starts to self-oligomerise. Such self-oligomerisation of SAS-5 allows for the attached molecules of SAS-6 to form a cartwheel. Bottom panel: detailed overview of the proposed process of centriole formation. In cytoplasm, where concentration of SAS-5 is low, the strong Implico domain (SAS-5 Imp, ZZ shape) of SAS-5 holds the molecule in a dimeric form. Each SAS-5 protomer can bind (through the disordered linker) to the coiled coil of dimeric SAS-6. Once SAS-5 / SAS-6 complex is targeted to the site where a daughter centriole is to be created, SAS-5 forms higher-order oligomers through self-oligomerisation of its coiled coil domain (SAS-5 CC – triple horizontal bar). Such large oligomer of SAS-5 provides multiple attachments sites for SAS-6 dimers in a very confied space. This results in a burst of local concentration of SAS-6 through the avidity effect, allowing an otherwise weak oligomer of SAS-6 to also form larger species. Effectively, this seeds the growth of a cartwheel (or a spiral in C. elegans), which in turn serves as a template for a new centriole.