3 Key Questions to Think About When Designing Proteins Computationally

We have reached the era of design, not just ‘hunting’. Particularly exciting to me is the de novo design of proteins, which have a wide and ever increasing range of applications from therapeutics to consumer products, biomanufacturing to biomaterials. Protein design has been a) enabled by decades of research that contributed to our understanding of protein sequence, structure & function and b) accelerated by computational advances – capturing the information we have learned from proteins and representing it for computers and machine learning algorithms.

In this blog post, I will discuss three key methodological considerations for computational protein design:

  1. Sequence- vs structure-based design
  2. ML- vs physics-based design
  3. Target-agnostic vs target-aware design


1. Sequence- vs structure-based design

Protein sequence (top) and structure (bottom) design.
Example shown is of the certolizumab CDRH3 loop, PDB: 5WUX.

Proteins are chains of amino acids [sequence], which (generally) fold into 3D shapes [structure]. In the design process, approaches centered on sequence or structure have been developed, each with advantages and drawbacks.

Sequence-based design

  • Motivations: there is a greater abundance of protein sequence than structural data (as solving protein structures is typically much more expensive and time-consuming than sequencing); it is more straightforward to represent sequences than structures for machine learning (ML)
  • Methods: given the data availability, ML has been extensively applied to sequence-based protein design
    • E.g., fitness landscape generation and sampling using the recurrent neural network UniRep protein encodings (Alley et al., 2019, Biswas et al., 2021)
    • See also a recent review on deep generative models for protein sequence design (Wu et al., 2021)
  • Applications:
    • Sequence optimization: models are trained to select for or increase the probability of a desirable feature
    • Sequence sampling: a representation of protein sequences (with desired properties) is generated by models and can be sampled from to generate new sequences


Structure-based design

  • Motivations: depending on the application (especially eg for designing proteins to bind a target), structural information may be required (NB structure can be predicted from sequence, especially given the breakthrough in structure modeling by AlphaFold2 (Jumper et al., 2021), but is not directly captured in it)
    • Structure-based design has become more feasible due to recent advances in structure prediction (eg Jumper et al., 2021; Abanades et al., 2022) as it is now possible to obtain accurate structures of most proteins (with limitations such as the modeling of post-translational modifications and ligand binding sites)
  • Methods: a range of methods have been developed for structure-based protein design, with an increasing focus on deep learning-based approaches. Examples include:
    • Fragment-based design – has been applied to antibodies, building from complementarity-determining loop (CDR) and framework (FR) region fragments (Aguilar Rangel et al., 2021)
    • Deep neural network model for designing sequences onto protein backbones (Anand et al., 2022)
    • Graph-based design (Ingraham et al., 2019)
    • See also a recent review on structure-based design using deep learning (Ovchinnikov et al., 2022)
  • Applications: structure-based design methods have been used for de novo protein design of specific architectures or binders

Outlook: Although it remains to be seen, I expect design methods in the future will incorporate – and evaluate – both sequence and structure (perhaps simultaneously). Existing approaches are already moving in that direction, such as a first attempt at antibody sequence-structure co-design using a generative model (Jin et al., 2021) and deep network hallucination, in which residue-residue distance maps (which have been used widely in structure prediction methods) are optimized (Anishchenko et al., 2021).


2. ML- vs physics-based design

Physics- (left) and ML- (right) based design.
(Left) The total Rosetta energy function (Alford et al., 2017). (Right) Schematic of simple neural network.

Traditionally protein modeling and design was accomplished using physics-based methods, which are built on the basis of physics equations and measurements (and approximations thereof). There has, for certain applications, been a shift in recent years towards ML-based methods, which are trained on large amounts of data. The ML methods can be more accurate and faster (eg AlphaFold2 (Jumper et al., 2021)) but do not necessarily learn the physics (Outeiral et al., 2022).

Physics-based design

  • Challenges
    • May require larger amount of computational resources (ML models are likely computationally expensive to train, but typically require fewer resources post-training)
    • Approximations need to be made
  • Methods: examples include Rosetta-based design methodologies incorporating the Rosetta force field (Alford et al., 2017) (eg RosettaRemodel (Huang et al., 2011))

ML-based design

  • Challenges
    • Need to represent data appropriately – sequences (eg one-hot encoding); structures (eg featurisation from structure, voxel grids, graphs)
    • Need sufficient and sufficiently diverse data – challenge in biology (especially for structural data)
  • Methods: see sequence- vs structure-based design section above, where a number of ML methods are included

Outlook: Again, and perhaps unsurprisingly, the ideal path forward may involve incorporating both ML- and physics-based approaches. This has already, for example, been implemented successfully for structure prediction (eg in AlphaFold2 (Jumper et al., 2021)).


3. Target-agnostic vs target-aware design

Target-agnostic (left) and target-aware (right) design.
(Left) Numerous candidates are generated and evaluated (PDB 5WUX).
(Right) Target-aware design example in which a known interaction is used as the basis for designing a novel binder for (Liu et al., 2017) (PDBs 5F72, 7ECA).

Many of the applications of engineered or designed proteins involve binding to a specific target, with the most prominent example being antibodies. When designing a protein binder, the structure of the target can be excluded from or explicitly included in the design process.

Target-agnostic design

  • Method: in target-agnostic design, a large number of candidates are typically generated without consideration for the binding target (for example using methods described in previous sections of this post) and subsequently evaluated. Computational methods can be used to narrow down the candidates (in future, as the evaluation methods improve, they may even be used to narrow down to only one final candidate). Evaluation will likely involve predicting the binding conformation (docking) and binding affinity.
  • Assessment
    • Advantage: greater available design space, which may allow a more stable or better binder to be sampled
    • Disadvantages: downstream evaluation methods are still very limited in accuracy (eg docking, affinity prediction); less control over specific binding site or involved interactions

Target-aware design

  • Methods:
    • Many existing target-aware methods make use of known protein-protein interactions, co-opting these in the design of a novel binder, for example
      • Hotspot grafting (Liu et al., 2017)
      • Rosetta (Cao et al., 2020)
      • Deep network hallucination (Anishchenko et al., 2021) – while it has not been implemented to date, the authors suggest that an interaction site can be fixed based on a known complex and the remaining protein can be ‘hallucinated’ by the network
    • These approaches are however restricted to cases with a known and structurally-resolved/modeled interaction. Recent methods aim to overcome this, for example (Cao et al., 2022), which involves docking disembodied amino acids against the target and subsequently docking protein scaffolds against the resulting ‘rotamer interaction fields’ to identify those that can accommodate interactions.
  • Assessment
    • Advantages: can incorporate existing knowledge of interaction; more control over specific binding site
    • Disadvantage: more limited design space (so may miss out on higher-affinity structure or interactions)

The considerations of target-agnostic vs -aware design were neatly summarized in an example given by Prof David Baker after a talk to the Imperial Bioengineering Department. The Baker group has been applying computational approaches for the design of SARS-CoV-2 binders. In one of these cases, they used both target-agnostic and -aware approaches, with the former yielding a higher-affinity binder (likely due to the ability to explore a wider search space and form interactions not found in the natural/canonical binding complex) (Cao et al., 2020). However, Prof Baker made note that prioritizing the specific target site and interactions could outweigh an increase in affinity – for example if it results in a decreased the likelihood of viral escape when considering SARS-CoV-2 variants.

Outlook: I believe the future of either approach will depend on methodological advances – for target-agnostic design, increased accuracy of downstream computational evaluation methods (eg docking, affinity prediction) will be essential; target-aware design would be accelerated by more progress in methods which do not require a known interaction interface.



Computational protein design is a growing and exciting field, with advances being fed by concurrent improvements in computational (especially ML) methods and data availability!



References

Abanades, B., Georges, G., Bujotzek, A., and Deane, C.M. (2022). ABlooper: Fast accurate antibody CDR loop structure prediction with accuracy estimation. Bioinformatics btac016.

Aguilar Rangel, M., Bedwell, A., Costanzi, E., Ricagno, S., Frydman, J., Vendruscolo, M., and Sormanni, P. (2021). Fragment-based computational design of antibodies targeting structured epitopes. BioRxiv 2021.03.02.433360.

Alford, R.F., Leaver-Fay, A., Jeliazkov, J.R., O’Meara, M.J., DiMaio, F.P., Park, H., Shapovalov, M. V., Renfrew, P.D., Mulligan, V.K., Kappel, K., et al. (2017). The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design. J. Chem. Theory Comput. 13, 3031–3048.

Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G.M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322.

Anand, N., Eguchi, R., Mathews, I.I., Perez, C.P., Derry, A., Altman, R.B., and Huang, P.S. (2022). Protein sequence design with a learned potential. Nat. Commun. 13, 1–11.

Anishchenko, I., Pellock, S.J., Chidyausiku, T.M., Ramelot, T.A., Ovchinnikov, S., Hao, J., Bafna, K., Norn, C., Kang, A., Bera, A.K., et al. (2021). De novo protein design by deep network hallucination. Nature 600, 547–552.

Biswas, S., Khimulya, G., Alley, E.C., Esvelt, K.M., and Church, G.M. (2021). Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396.

Cao, L., Goreshnik, I., Coventry, B., Case, J.B., Miller, L., Kozodoy, L., Chen, R.E., Carter, L., Walls, A.C., Park, Y.J., et al. (2020). De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science (80-. ). 370, 426–431.

Cao, L., Coventry, B., Goreshnik, I., Huang, B., Park, J.S., Jude, K.M., Marković, I., Kadam, R.U., Verschueren, K.H.G., Verstraete, K., et al. (2022). Design of protein binding proteins from target structure alone. Nature.

Huang, P.S., Ban, Y.E.A., Richter, F., Andre, I., Vernon, R., Schief, W.R., and Baker, D. (2011). Rosettaremodel: A generalized framework for flexible backbone protein design. PLoS One 6.

Ingraham, J., Garg, V.K., Barzilay, R., and Jaakkola, T. (2019). Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst. 32.

Jin, W., Wohlwend, J., Barzilay, R., and Jaakkola, T. (2021). Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design. ArXiv.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589.

Liu, X., Taylor, R.D., Griffin, L., Coker, S.-F., Adams, R., Ceska, T., Shi, J., Lawson, A.D.G., and Baker, T. (2017). Computational design of an epitope-specific Keap1 binding antibody using hotspot residues grafting and CDR loop swapping. Sci. Rep. 7, 41306.

Outeiral, C., Nissley, D.A., and Deane, C.M. (2022). Current structure predictors are not learning the physics of protein folding. Bioinformatics 1–7.

Ovchinnikov, S., and Huang, P.S. (2021). Structure-based protein design with deep learning. Curr. Opin. Chem. Biol. 65, 136–144.

Wu, Z., Johnston, K.E., Arnold, F.H., and Yang, K.K. (2021). Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27.

Author