Author Archives: Waqar Ali

A new method to improve network topological similarity search: applied to fold recognition

Last week I discussed the recent paper by Lhota et al. proposing a network-based similarity search method, applied to the problem of protein fold prediction. Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, the authors propose a new algorithmic framework, Enrichment of Network Topological Similarity (ENTS). While the method is general in scope, in the paper, authors focus exclusively on the protein fold recognition problem.

Fig 1: ENTS pipeline for protein fold prediction.

To initialize ENTS for structure prediction, ENTS builds a structural similarity graph of protein domains (Fig 1). The structural similarity graph is a weighted graph with one node for each structural domain and an edge between two nodes only if their pairwise similarity exceeds a certain threshold. In this article, the structural similarity score is determined by TM align with a threshold of 0.4. Next, some or all the structural domains in the database are labeled with SCOP. Given a query domain sequence and the goal to predict its structure, ENTS first links the query to all nodes in the structural similarity graph. The weights of these new edges are based only on the sequence profile-profile similarity derived from HHSearch. Then random walk with restart (RWR) is applied to perform a probabilistic traversal of the instance graph across all paths leading away from the query, where the probability of choosing an edge will be proportional to its weight. The algorithm will output a ranked list of probabilities of reaching each node in the structural graph from the query, thus potentially uncovering relationships missed by pair-wise comparison methods. ENTS also uses an enrichment analysis step to assess the reliability of the detected relationships, by comparing the mean relationship strength of a SCOP cluster in the structural graph and the query, to that of random clusters.

For testing the method, the authors first constructed a structural graph using 36,003 non-redundant protein domains from the PDB. The query benchmark set consisted of 885 SCOP domains, constructed by randomly selecting each domain from folds spanning at least two super-families. An additional step before prediction on the query set was to remove all domains from the structral graph which were in the same super-family as the query. The method was compared to existing methods such as CNFPred and HHSearch and it’s network approach and enrichment analysis step were found to contribute significantly to the accuracy of fold prediction. While the method seems to be an improvement on existing methods and is a novel use of network-based approaches to fold prediction, the false positive rate is still very high. One way of overcoming this, suggested by the authos is the use of energy-based scoring functions to further prune the list of potential hits returned by the method.

A topology-based distance measure for network data

In last week’s group meeting, I introduced our network comparison method (Netdis) and presented some new results that enable the method to be applied to larger networks.

The most tractable methods for network comparison are those which compare at the level of the entire network using statistics that describe global properties, but these statistics are not sensitive enough to be able to reconstruct phylogeny or shed light on evolutionary processes. In contrast, there are several network alignment based methods that compare networks using the properties of the individual proteins (nodes) e.g. local network similarity and/or protein functional or sequence similarity. The aim of these methods is to identify matching proteins/nodes between networks and use these to identify exact or close sub-network matches. These methods are usually computationally intensive and tend to yield an alignment which contains only a relatively small proportion of the network, although this has been alleviated to some extent in more recent methods.

Thus, we do not follow the network alignment paradigm, but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolutionary relationships. Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have been applied to construct trees from sequence data. A key feature is the standardisation of the counts to separate the signal from the background noise. Inspired by alignment-free sequence comparison we use subgraph counts instead of sequence homology or functional one-to-one matches to compare networks. Our proposed method, Netdis, compares the subgraph content not of the networks themselves but instead of the ensemble of all protein neighbourhoods (ego-networks) in each network, through an averaging many-to-many approach. The comparison between these ensembles is summarised in a Netdis value, which in turn is used as input for phylogenetic tree reconstruction.

Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Fig1: Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Extensive tests on simulated and empirical data-sets show that it is not necessary to analyze all possible ego-networks within a network for Netdis to work. Our results indicate that in general, randomly sampling around 10% of egos in each network results in a very similar clustering of networks on average, compared to the tree with 100% sampling (Fig 1). This result has important implications for use-cases where eextremely large graphs are to be compared (e.g > 100,000 nodes). Related to the ego-nework sub-sampling idea is the notion of size-limiting the ego-networks that are to be analyzed by the algorithm. Our tests show that the vast majrity of ego-netowrks in most networks have a relatively low coverage of the overall network. Moreover, by introducing lower-size threshold on the egos, we observe better results on average. Together, this means a limited range of ego-network sizes to be analyzed for each network, which should lead to better statitical properties as the sub-sampling scheme is inspired by bootstrapping.

ECCB 2014 (Strasbourg)

The European Conference on Computational Biology was held in Strasbourg, France this year. The conference was attended by several members of OPIG including Charlotte Deane, Waqar Ali, Eleanor Law, Jaroslaw Nowak and Cristian Regep. Waqar gave a talk on his paper proposing a new distance measure for network comparison (Netdis). There were many interesting talks and posters at the conference, and brief summaries of the ones we found most relevant are given below.

The impact of incomplete knowledge on the evaluation of protein function prediction: a structured output learning perspective.

Authors: Yuxiang Jiang, Wyatt T. Clark, Iddo Friedberg and Predrag Radivojac

Chosen by: Waqar Ali

The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy.

In the paper, the authors studied the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, they provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. They also analyzed real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. They find a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, their results also suggest that current large-scale evaluations are meaningful and surprisingly reliable. The choice of function prediction methods evaluated by the authors is not exhaustive however and it is quite possible that other methods might be much more sensitive to incomplete annotations.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA.

Authors: Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos & Ewan Birney

Chosen by: Jaroslaw Nowak

This was one of the keynote speaker talks. One of the authors, Ewan Birney discussed how viable is storing digital information in DNA code. The paper talked about storing 739 kilobytes of hard-disk storage in the genetic code. The data stored included all 154 of Shakespeare’s sonnets in ASCII text, a scientific paper in PDF format, a medium-resolution colour photograph of the European Bioinformatics Institute in JPEG format, a 26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech in MP3 format and a Huffman code used in their study to convert bytes to base-3 digits (ASCII text). The authors accomplished this by first converting the binary data into base-3 numbers using Huffman coding (which represents the most common piece of information using the least bits). The base-3 numbers where then converted into a genetic code in such a way that produced no homopolymers. The authors also proved that they can read the encoded information and reconstruct the data with 100% accuracy.

Using DNA for information storage could very soon become cost-effective for sub-50 years archiving. The current costs of the process are $12,400/MB for writing and $220/MB for reading information, with negligible costs of copying. Nevertheless, if the current trends persist, we could see a 100 – fold drop in costs in less than a decade. The main advantages of storing information in DNA is the low maintenance and durability of the medium (intact DNA fragments have been recovered from samples that are tens of thousands years old) as well as little physical space required to store the information (~2.2 PB/g)

PconsFold: Improved contact predictions improve protein models.

Authors: Michel M, Hayat S, Skwark MJ, Sander C, Marks DS and Elofsson A.

Chosen by: Eleanor Law

De novo structure prediction by fragment assembly is a very difficult task, but can be aided by contact prediction in cases where there is plenty of sequence data. Contact prediction has also significantly improved recently, using statistical methods to separate direct from indirect contact information.

PSICOV and plmDCA are two such methods, providing contacts which can be used by software such as Rosetta as an additional energy term. PconsFold combines 16 different sets of contact predictions by these programs, built from different sequence alignments, with secondary structure and solvent accessibility prediction. The output of the deep learning process on these inputs is more reliable that the individual contact predictions alone, and produces more accurate models. The authors found that using only 2,000 decoys rather than 20,000 did not greatly harm their results, which is encouraging as the decoy generation stage is the particularly resource intensive stage. Using a balance between the Rosetta energy function and weight of contact prediction, the optimal number of constraints to include was around 2 per residue, compared to 0.5 per residue for PSICOV or plmDCA alone.

The PconsFold pipeline is not always able to make full use of the contact prediction, as accuracy of contact prediction on the true structure can be higher than that in the model. This is a case where the conformational search is not effective enough to reach the correct answer, though it would be scored correctly if it were obtained. All-beta proteins are the most difficult to predict, but PconsFold compares favourably to EVfold-PLM for each of the mainly alpha, mainly beta, and alpha & beta classes.

Modeling of Protein Interaction Networks

In the group meeting on the 20th of August, I presented the paper by Vazquez et al (2002). This was one of the first papers proposing the duplication divergence model of evolution for protein interaction networks, and thus has had a significant impact on the field, inspiring many variants of the basic model. The paper starts out by noting that the yeast protein network has the ‘small world’ property – following along links in the network, it requires only a handful of steps to go from any one protein to any other. Another property is the manner in which the links are shared out among the various proteins: empirically, the probability that a protein interacts with k other proteins follows a power-law distribution.

Vazquez et al. show how evolution can produce scale-free networks. They explore a model for the evolution of protein networks that accurately reproduces the topological features seen in the yeast S. cerevisae. As the authors point out, proteins fall into families according to similarities in their amino-acid sequences and functions, and it is natural to suppose that such proteins have all evolved from a common ancestor. A favoured hypothesis views such evolution as taking place through a sequence of gene duplications – a relatively frequent occurrence during cell reproduction. Following each duplication, the two resulting genes are identical for the moment, and naturally lead to the production of identical proteins. But between duplications, random genetic mutations will lead to a slow divergence of the genes and their associated proteins. This repetitive, two-stage process can be captured in a relatively simple model for the growth of a protein interaction network .

Simulations by the author show that the model, starting out with a seed capture the degree distribution of the yeast network with high fidelity (Fig 1) and also possesses the quality of high tolerance to random node removal seen in biological networks. While the results are more qualitative in nature, the model still serves as the basis of most biologically rooted explanations of protein network evolution, with minor improvements. Some of these additions have been the use of asymmetric divergence, whole genome duplication events as well as interaction site modelling. As the jury is still out on what model (if any) best fits current interaction data, the basic model is still relevant as a benchmark for newer models.

Fig. 2. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard de- viation on a single network realization.

Fig. 1. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard deviation on a single network realization.

Efficient discovery of overlapping communities in massive networks

Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In the paper by Gopalan et al. discussed in this week’s group meeting, a scalable approach to community detection is developed that discovers overlapping communities in massive real-world networks. The approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities.

The model assumes there are $K$ communities and that each node $i$ is associated with a vector of community memberships $\theta_i$ . To generate a network, the model considers each pair of nodes. For each pair $(i,j)$ , it chooses a community indicator $z_{i \rightarrow j}$ from the $i^{th}$ node’s community memberships $\theta_i$ and then chooses a community indicator $z_{i \leftarrow j}$ from $\theta_j$ . A connection is then drawn between the nodes with probability $\beta$ if the indicators point to the same community or with a smaller probability $\epsilon$ if they do not.

This model defines a joint probability $p(\theta,\beta,z,y)$ where $y$ is the observed data. To estimate the posterior $p(\theta,beta,z | y)$ , the method uses a stochastic variational inference algorithm. This enables posterior estimation using only a sample of all-possible node pairs at each step of the variational inference, making the method applicable to very large graphs (e.g analyzing a large citation network of physics papers shown in the figure below identifies important papers impacting several sub-disciplines).

Community detection in a physics citation network.

One limitation of the method is that it does not incorporate automatic estimation of the number of communities, which is a general problem with clustering algorithms. Still, enabling sophisticated probabilistic analysis of structure in massive graphs is a significant step forward.

Oxford Protein Informatics Group

or "OPIG" to friends