Learning dynamical information from static protein and sequencing data

I would like to advertise the research from Pearce et al. (https://doi.org/10.1101/401067) whose talk I attended at ISMB 2019. The talk was titled ‘Learning dynamical information from static protein and sequencing data’. I got interested in it as my field of research is structural biology which deals with dynamics systems, e.g. proteins, but data is often static, e.g. structures from X-ray crystallography. They presented a general protocol to infer transition rates between states in a dynamical system that can be represented with an energy landscape.

First, a number of C Gaussians has to be chosen to fit data of original D dimensional space, reduced with principal component (PC) analysis (Figure 1A). The number C should be at least the number of energy minima of the system that one wants to model. Then, a Gaussian mixture model (GMM) with d dimensions (d = C – 1) has to be fit. Hereby D – d dimensions are neglected. From the GMM the energy landscape is inferred and reduced to a formulation of minimum energy paths (MEPs) (Figure 1B). To preserve the correct landscape topology, GMM components have to be rescaled. He mentioned in his talk that this is only exact for a dimensionality reduction by 1 (hence, d = C -1). MEPs are then scaled back to the initial D dimensions and are now describing the dynamical system in a simpler way in the original dimensions. They mention in their paper that this is either exact or the result could be used to as initial conditions for calculations in the original dimension to save computational time. This final step is the calculation of mean first passage times (MFPTs) by inferring transition rates (Figure 1C).

Figure 1: Learning dynamical information from static protein and sequencing data: graphic abstract. A) Raw data, PCA. B) MEPs that describe different energy minima states and their potential transition paths. C) MFPT calculated with reduced dimensions and scaling term match the original as well as the non-dimension-reduced prediction. Figure taken from Pearce et al. 2019, bioRxiv 401067.

In my understanding, the dimensionality reduction (by 1) at the stage of GMM formulation into MEPs is not where the computational cost is saved but by approximating D with C. A colleague mentioned to me that there are computationally efficient algorithms to estimate C. The crucial question is therefore how many data points are needed to correctly fit each Gaussian to an energy minimum. In the talk, he mentioned an order of 100-1000, but I’m curious to see a more thorough investigation in a future publication.

Please find their paper on bioRxiv, ‘Learning dynamical information from static protein and sequencing data’, for example studies on protein folding, gene regulatory networks and viral evolution. As mentioned in the beginning, a protein is a dynamic system but most structure determination yields a static model. Cryogenic electron microscopy (cryo-EM) produces very many (static) images from a protein sample as well. But the sample is frozen very quickly and a protein is thus frozen in many different states of its dynamic ensemble. Therefore, the methodology from Pearce et al. could be very interesting to study protein dynamics with cryo-EM. This was also mentioned briefly during the talk, but nothing has been published yet. Keep an eye out for their next publication.

Author