Efficient discovery of overlapping communities in massive networks

Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In the paper by Gopalan et al. discussed in this week’s group meeting, a scalable approach to community detection is developed that discovers overlapping communities in massive real-world networks. The approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities.

The model assumes there are K communities and that each node i is associated with a vector of community memberships \theta_i. To generate a network, the model considers each pair of nodes. For each pair (i,j), it chooses a community indicator z_{i \rightarrow j} from the i^{th} node’s community memberships \theta_i and then chooses a community indicator z_{i \leftarrow j} from \theta_j. A connection is then drawn between the nodes with probability \beta if the indicators point to the same community or with a smaller probability \epsilon if they do not.

This model defines a joint probability p(\theta,\beta,z,y) where y is the observed data. To estimate the posterior p(\theta,beta,z | y), the method uses a stochastic variational inference algorithm. This enables posterior estimation using only a sample of all-possible node pairs at each step of the variational inference, making the method applicable to very large graphs (e.g analyzing a large citation network of physics papers shown in the figure below identifies important papers impacting several sub-disciplines).

Community detection in a physics citation network.

Community detection in a physics citation network.

One limitation of the method is that it does not incorporate automatic estimation of the number of communities, which is a general problem with clustering algorithms. Still, enabling sophisticated probabilistic analysis of structure in massive graphs is a significant step forward.

Author