Current strategies to predict structures of multiple protein conformational states

Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.

Adaptations of AF2

AF2 architecture [1]

Many attempts have been made to optimize AF2 to predict alternative conformational states of proteins. These have mostly focused on manipulating the information AF2 uses for its prediction. The most common approach is to alter the multiple sequence alignment (MSA), which forms one of the AF2 inputs (see the figure above). This approach is motivated by findings that MSAs contain coevolutionary signals for several conformational states of a protein. If these signals could be deconvolved, AF2 should be able to predict structures of multiple conformations. The best-studied strategy for deconvolving these signals is to reduce the depth of MSAs. The default MSAs, typically several hundred sequences deep (exact numbers depend on the protein), are subsampled randomly to obtain shallow MSAs containing as few as five sequences. Shallow MSAs have been shown to increase the diversity of output models and, in many cases, generate structures similar to multiple conformations of case study proteins [2-4]. Alternative ways to manipulate MSAs have also been investigated, such as limiting the MSA to sequences with very high sequence identity to the protein of interest [5] and manually introducing point mutations of residues believed to stabilize one of the protein’s conformational states [6].

It has also been tried to alter the structural databases, which form the second AF2 input. Typically, customized databases are generated in an attempt to steer AF2 toward preferentially predicting a specific conformational state for members of a given protein family [3, 7]. For example, Heo et al. [7] predicted the structures of G-protein-coupled receptors (GPCRs) by exclusively providing AF2 with template structures of GPCRs in their active state. In this way, they obtained a higher fraction of GPCRs modeled in their active state compared to when the default databases were used.

Improved exploration of contact and distance maps

Contact and distance maps, predicted from MSAs by protein structure prediction methods, contain information about alternative protein conformations even when run in default mode (i.e., MSAs and structural databases are not manipulated). Predicted inter-residue distance distributions have demonstrated bimodal characteristics for specific residues that undergo conformational changes [8], and predicted contact maps reveal contacts unique to distinct conformational states [9]. Moreover, individual structure prediction methods yield slightly different contacts for the same protein due to differences in their training data [10]. Algorithms specifically designed to enhance the exploration of predicted contact and distance maps have been developed to generate structures representing various potential conformations [9-11].

In their study, Hou et al. [10] detailed a method that utilizes distance maps predicted from AF2, RoseTTAFold, and other structure prediction tools as input. These maps are employed to construct multiple energy landscapes, which are subsequently merged into a single energy landscape with competing constraints. An evolutionary algorithm is then employed to traverse this landscape, identifying several low-energy solutions that represent potential conformations.

Overview of the Hou et al. method [10]

Generative models for conformation prediction

Generative models can sample distributions of outputs instead of providing a deterministic mapping from an input to a single output. In theory, such models are well-suited for conformation prediction tasks where we aim to generate multiple related output structures for a given input sequence. Methods based on diffusion models [12] and variational autoencoders [13] have been described.

For example, Jing et al. [12] developed the EigenFold method, a diffusion model trained for protein structure prediction, and investigated its potential to sample structures of multiple conformations. They found that sampled structures are generally poor models of the different protein conformations; however, the diversity of sampled structures provides an indication of the true flexibility.

Overview of the Eigenfold method [12]

Limitations of current methods

Recently, many strategies to predict the structure of multiple conformational states have been investigated. As there is currently no large and widely used benchmark, it is unclear how these methods generalize to proteins other than the limited number of case studies they were tested on and which of these methods performs best. Another limitation of nearly all discussed methods is that they tend to generate a large number of structural models. Some of these resemble the actual conformations a protein adopts, but most tend to be noise. Without prior knowledge of a particular protein, it is hard to differentiate between ‘correct’ models and noise. In conclusion, most strategies focus on adapting a protein structure prediction tool for the task of conformation prediction, and there are very few methods specifically trained for the task. Manipulation of AF2 MSAs has received most of the attention, but for reasons stated above, it remains to be seen whether this is the optimal approach.

Try it yourself

AF2 with shallow MSAs is the simplest of the above-described methods to run yourself. All you need to do is open ColabFold and manipulate some of the default parameters. Using adenylate kinase, a two-state protein with apo (PDB 4AEK) and holo (PDB 2ECK) conformations, I will briefly demonstrate how to do this

Firstly, enter the amino acid sequence in the ‘input protein sequence(s)’ block, then go to the ‘advanced settings’ block and choose the value of the max_msa field (see figure below). This field sets two AF2 parameters in the following format: max_seqs:extra_seqs. Both of these determine the number of sequences subsampled from the MSA (max_seqs sets the number of sequences passed to the row/column attention track and extra_seqs the number of sequences additionally processed by the main evoformer stack). Optimal values depend on the protein, so you may have to play around with these parameters to get the best output. Generally, lower values encourage more diverse predictions but increase the number of misfolded models. If necessary, you can additionally increase the num_seeds or reduce the num_recycles parameters to produce more diverse outputs. For the adenylate kinase example, I will use max_msa: 16:32, num_seeds: 4, and num_recycles: 3.

Using these parameters 20 models for adenylate kinase are produced. Some of the models are more similar to the apo (4AKE) or holo structure (2ECK), but none fall within 2 Å of either conformation. To obtain more accurate models for either state, it could be worth a try to rerun AF2 with the MSA depth further reduced.

References

  1. Jumper, John, et al. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature, vol. 596, no. 7873, Aug. 2021, pp. 583–89, https://doi.org/10.1038/s41586-021-03819-2.
  2. del Alamo, Diego, et al. “Sampling Alternative Conformational States of Transporters and Receptors with AlphaFold2.” ELife, edited by Janice L Robertson et al., vol. 11, Mar. 2022, p. e75751, https://doi.org/10.7554/eLife.75751.
  3. Faezov, Bulat, and Roland L. Dunbrack. AlphaFold2 Models of the Active Form of All 437 Catalytically-Competent Typical Human Kinase Domains. bioRxiv, 25 July 2023, https://doi.org/10.1101/2023.07.21.550125.
  4. Silva, Gabriel Monteiro da, et al. Predicting Relative Populations of Protein Conformations without a Physics Engine Using AlphaFold2. bioRxiv, 27 July 2023, https://doi.org/10.1101/2023.07.25.550545.
  5. Wayment-Steele, Hannah K., et al. Prediction of Multiple Conformational States by Combining Sequence Clustering with AlphaFold2. bioRxiv, 17 Oct. 2022, https://doi.org/10.1101/2022.10.17.512570.
  6. Stein, Richard A., and Hassane S. Mchaourab. “SPEACH_AF: Sampling Protein Ensembles and Conformational Heterogeneity with Alphafold2.” PLOS Computational Biology, vol. 18, no. 8, Aug. 2022, p. e1010483, https://doi.org/10.1371/journal.pcbi.1010483.
  7. Heo, Lim, and Michael Feig. “Multi-State Modeling of G-Protein Coupled Receptors at Experimental Accuracy.” Proteins, vol. 90, no. 11, Nov. 2022, pp. 1873–85, https://doi.org/10.1002/prot.26382.
  8. Schwarz, Dominik, et al. “Co-Evolutionary Distance Predictions Contain Flexibility Information.” Bioinformatics, edited by Alfonso Valencia, vol. 38, no. 1, Dec. 2021, pp. 65–72, https://doi.org/10.1093/bioinformatics/btab562.
  9. Li, Jiaxuan, et al. Exploring the Alternative Conformation of a Known Protein Structure Based on Contact Map Prediction. bioRxiv, 9 June 2022, https://doi.org/10.1101/2022.06.07.495232.
  10. Hou, Ming-Hua, et al. Protein Multiple Conformations Prediction Using Multi-Objective Evolution Algorithm. bioRxiv, 21 Apr. 2023, https://doi.org/10.1101/2023.04.21.537776.
  11. Peng, Chunxiang, et al. Multiple Conformational States Assembly of Multidomain Proteins Using Evolutionary Algorithm Based on Structural Analogues and Sequential Homologues. bioRxiv, 18 Jan. 2023, https://doi.org/10.1101/2023.01.15.524086.
  12. Jing, Bowen, et al. EigenFold: Generative Protein Structure Prediction with Diffusion Models. arXiv:2304.02198, arXiv, 4 Apr. 2023, https://doi.org/10.48550/arXiv.2304.02198.
  13. Mansoor, Sanaa, et al. Protein Ensemble Generation through Variational Autoencoder Latent Space Sampling. bioRxiv, 1 Aug. 2023, https://doi.org/10.1101/2023.08.01.551540.

Author