Efficient discovery of overlapping communities in massive networks

Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In the paper by Gopalan et al. discussed in this week’s group meeting, a scalable approach to community detection is developed that discovers overlapping communities in massive real-world networks. The approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities.

The model assumes there are $latex K$ communities and that each node $latex i$ is associated with a vector of community memberships $latex \theta_i$. To generate a network, the model considers each pair of nodes. For each pair $latex (i,j)$, it chooses a community indicator $latex z_{i \rightarrow j}$ from the $latex i^{th}$ node’s community memberships $latex \theta_i$ and then chooses a community indicator $latex z_{i \leftarrow j}$ from $latex \theta_j$. A connection is then drawn between the nodes with probability $latex \beta$ if the indicators point to the same community or with a smaller probability $latex \epsilon$ if they do not.

This model defines a joint probability $latex p(\theta,\beta,z,y)$ where $latex y$ is the observed data. To estimate the posterior $latex p(\theta,beta,z | y)$, the method uses a stochastic variational inference algorithm. This enables posterior estimation using only a sample of all-possible node pairs at each step of the variational inference, making the method applicable to very large graphs (e.g analyzing a large citation network of physics papers shown in the figure below identifies important papers impacting several sub-disciplines).

Community detection in a physics citation network.

One limitation of the method is that it does not incorporate automatic estimation of the number of communities, which is a general problem with clustering algorithms. Still, enabling sophisticated probabilistic analysis of structure in massive graphs is a significant step forward.

How many bins?

As it’s known in non-parametric kernel density estimation the effect of the bandwidth on the estimated density is large and it is usually the parameter who makes the tradeoff between bias and roughness of the estimation (Jones et.al 1996). An analogous problem for histograms is the choice of the bin length and in cases of equal bin lengths the problem can be seen as finding the number of bins to use.  A data-base methodology for building equal bin-length histograms proposed by (Knuth 2013) based on the marginal of the joint posterior of the number of bins and heights of the bins. To build the histogram first the number of bins has to be selected as the the value ($latex \hat{M}$) that maximises the following posterior distribution for the number of bins:
$latex P(M|d,I)\, \alpha \,(M/V)^N \frac{\Gamma(M/2) \prod_{k=1}^M \Gamma(n_k+1/2)}{\Gamma(1/2)^M \Gamma(N+M/2)}$

where $latex M$ is the number of bins, $latex d$ is the data, $latex I$ is prior knowledge about the problem, i.e. in particular the use of equal length bins and the range of data $latex V$, which has the relation $latex V=Mw$ where $latex w$ is the width of bins, $latex N$ is the number of data points and $latex n_k$ is the number of observations that fall in the $latex k$th bin.

Now, the height ($latex h_k$) of the bins of the histogram is given by:
$latex h_k=\frac{M}{V} \frac{n_k+1/2}{N+M/2}$.

In the case of a normal distribution the authors suggest a sample of 150 data points to “accurately and consistently estimate the shape of the distribution”.

The following figure shows the relative log-posterior of the number of bins (left) and the estimated histogram for a mixture of three normal samples and a uniform [0,50] (right).

Knuth, K. H. (2013). Optimal data-based binning for histograms. arXiv preprint physics/0605197. The first version of this paper was published on 2006.

Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association,91(433), 401–407.

Annotate Antibody CDR and Framework Residues

Intro

Antibodies have very well conserved structures and their binding site is chiefly comprised of the six CDRs. The great similarity between the 1700+ antibody structures that can be found in SAbDab/PDB prompted the introduction of numbering schemes which act as coordinates with respect to the sequence/structural features of antibodies. The earliest such numbering scheme was introduced by Wu and Kabat, followed by the structurally informed Chothia-scheme which was eventually amended by Abhinandan and Martin. Even though there are several of those schemes, the one currently endorsed by the World Health Organization (WHO) is this of IMGT.

It annotates the framework and CDR residues according to three definitions: Kabat, Chothia or Contact. You can download it here.

Possible issues?

You need internet connection for this program to work since it calls the Abnum service, thus you should cite the following if you use this code:

Abhinandan, K.R. and Martin, A.C.R. (2008) Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains Molecular Immunology, 45, 3832-3839.

How to use it?

As an example test case, type the following in the Framer directory:

python Framer.py --f 1A2Y.pdb --c AB --o my_first_output --d chothia

This should get the the heavy and light chains of 1A2Y (A and B) and leave the output in a folder called my_first_outupt.

Options:

–f: Antibody file
–c: Antibody chains (you can submit just one or several)
–o: Output folder name – NB this is going to be created in the directory you call Framer from!
–d: CDR definition to be used, possible options are: chothia, kabat and contact.

Output files:

There are four output files:

red_blue.pdb: The pdb with b-factor colored CDRs. The CDRs have B-factor of 100.00 and the framework 0.00.
paratope.txt: The CDR residues, given in the format [id][whitespace][chain]
framework.txt: The Framework residues, given in the format [id][whitespace][chain]
full_info.txt: Full breakdown of the annotation given in the format:

Original ID Original Chain AA Chothia ID CDR(FR=frame,or CDR id)

Journal club: Half a century of Ramachandran plots

In last week’s journal club we delved into the history of Ramachandran plots (Half a century of Ramachandran plots; Carugo & Djinovic-Carugo, 2013).

Polypeptide backbone dihedral angles. Source: Wikimedia Commons, Bensaccount

50 years ago Gopalasamudram Narayana Ramachandran et al. predicted the theoretically possible conformations of a polypeptide backbone. The backbone confirmations can be described using three dihedral angles: ω, φ and ψ (shown to the right).

The first angle, ω, is restrained to either about 0° (cis) or about 180° (trans) due to the partial double bond character of the C-N bond. The φ and ψ angles are more interesting, and the Ramachandran plot of a protein is obtained by plotting φ/ψ angles of all residues in a scatter plot.

The original Ramachandran plot showed the allowed conformations of the model compound N-acetyl-L-alanine-methylamide using a hard-sphere atomic model to keep calculations simple. By using two different van der Waals radii for each element positions on the Ramachandran plot could be classified into either allowed regions, regions with moderate clashes and disallowed regions (see Figure 3 (a) in the paper).

The model compound does not take side chains into account, but it does assume that there is a side chain. The resulting Ramachandran plot therefore does not describe the possible φ/ψ angles for Glycine residues, where many more conformations are plausible. On the other end of the spectrum are Proline residues. These have a much more restricted range of possible φ/ψ angles. The φ/ψ distributions of GLY and PRO residues are therefore best described in their own Ramachandran plots (Figure 4 in the paper).

Over time the Ramachandran plot was improved in a number of ways. Instead of relying on theoretical calculations using a model compound, we can now rely on experimental observations by using high quality, hand picked data from the PDB. The way how the Ramachandran plot is calculated has also changed: It can now be seen as a two-dimensional, continuous probability distribution, and can be estimated using a full range of smoothing functions, kernel functions, Fourier series and other models.
The modern Ramachandran plot is much more resolved than the original plot. We now distinguish between a number of well-defined, different regions which correlate with secondary protein structure motifs.

Ramachandran plots are routinely used for structure validation. The inherent circular argument (A good structure does not violate the Ramachandran plot; The plot is obtained by looking at the dihedral angles of good structures) sounds more daring than it actually is. The plot has changed over time, so it is not as self-reinforcing as one might fear. The Ramachandran plot is also not the ultimate guideline. If a new structure is found that claims to violate the Ramachandran plot (which is based on a huge body of cumulative evidence), then this claim needs to be backed up by very good evidence. A low number of violations of the plot can usually be justified. The Ramachandran plot is a local measure. It therefore does not take into account that domains of a protein can exert a force on a few residues and just ‘crunch’ it into an unusual conformation.

The paper closes with a discussion of possible future applications and extensions, such as the distribution of a protein average φ/ψ and an appreciation of modern web-based software and databases that make use of or provide insightful analyses of Ramachandran plots.

Journal club (Bernhard Knapp): NetMHCIIpan-3.0

This week’s topic of the Journalclub was about the prediction of potential T cell epitopes (=prediction of the binding affinity between peptides and MHC). In this context I presented the latest paper in the NetMHCpan series:

Karosiene E, Rasmussen M, Blicher T, Lund O, Buus S, Nielsen M. NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics. 2013 Oct;65(10):711-24

The reliable prediction of the peptide/MHC binding affinity is already a scientific aim for several years: Early approaches were based on the concept of anchor residues i.e. those peptide side-chain which are pointing in the direction of the MHC. The next “generation” of methods was matrix based i.e. it was simply counted how likely it is that a specific residue is at peptide position 1, 2, 3, … for binding and non-binding peptides of a specific MHC allele. In the next step methods started to incorporate higher order approaches i.e. positions within the peptide influence each other (e.g. SVM, ANN, …). While such methods are already quite reliable their major limitation is that they can only be trained on the basis of sufficient experimental binding data per allele. Therefore a prediction for alleles with only few experimental peptide binding affinities is not possible or rather poor in quality. This is especially import because there are several thousand HLA alleles: The latest version of the IPD – IMGT/HLA lists for example 1740 HLA Class I A and 2329 HLA Class I B proteins (http://www.ebi.ac.uk/ipd/imgt/hla/stats.html). Some of them are very frequent and others are not but this the aim would be a 100% coverage.

Pan specific approaches go one step further and allow the binding prediction between peptide and an arbitrary MHC allele for which the sequence is known. The NetMHCpan approach is one of them and is based on the extraction of a pseudo sequence (those MHC residues which are in close proximity to the peptide) per MHC allele. Subsequently this MHC pseudo sequence as well as the peptide sequences are encoded using some algorithm e.g. a BLOSUM matrix. Finally the encoded sequences and the experimental binding affinities are fed into an Artificial Neural Network (ANN). For an illustration of the workflow see:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2981877/bin/1745-7580-6-S2-S3-2.jpg

This approach turned out to be quite successful and (not too much) depending on the allele the area under ROC reaches in average a healthy 0.8x number which is already quite competitive with experimental approaches.

The netMHCpan series has progressed over the last years starting to cover MHC class I in humans (2007) and beyond (2009). Then MHC class II DR (2010) and in the latest version also DP and DQ (2013). With the next paper expected to be about non human MHC class II alleles the prediction of the binding affinity between peptides and pretty much every MHC should be possible.

AH^AH – Alpha Helices Assessed by Humans.

We have created a game-like web-application to classify protein helices. If you participate it will take you about 15 minutes and afterwards we will show you the relation between your results and all other participants. You do not need any specific training to participate – your natural human puzzle solving skills are sufficient. Try to beat the others!

We are interested in kinked α-helices in proteins. Kinks are important for protein function, and understanding them will help to treat and design therapies for diseases. One hindrance to our work is that helix kinks are not well defined – there are many helices that are classed as kinked by some scientists, but not by others. Classifying helices as kinked or curved or straight is difficult computationally. There are lots of different way to do it, but they all give different answers. We think that humans can do it better.