Pose Prediction: Does Your Model Generalize? The Role of Data Similarity

In our recent work with the PoseBusters benchmark, we made a deliberate choice: to include both receptors seen during training and completely novel ones. Why? To explore an often-overlooked question: how much does receptor similarity to training data influence model performance?

As shown in Figure 4 of our paper [1], we split the PoseBusters dataset into three groups based on sequence identity to receptors in PDBBind2020 (the dataset the models were trained on): high, medium, and low similarity. The results were clear—the more similar a test receptor is to the training set, the better the model tends to perform.

This type of analysis is crucial. Reporting a single overall metric can mask important differences in model generalization. By stratifying results based on similarity, we gain a better understanding of where models succeed and where they struggle.

Two points are important to keep in mind. First, this stratification must be relative to the specific training data used. If a new model was trained on all PDB complexes up to 2021, then the PoseBusters splits (based on PDBBind2020) would not be directly applicable, as PDBBind2020 is only a subset of that data. Second, while our paper included results stratified by receptor sequence identity, similarities in binding pockets, ligands, and interaction patterns also play significant roles.

So, next time you see a headline number for pose prediction accuracy, it’s worth asking: how different was the test data from the training data? How similar were the receptors, ligands, and their interactions? As our Figure 4 demonstrates, these factors matter significantly.

Fig. 4: Comparative performance of docking methods on the PoseBusters Benchmark set stratified by sequence identity relative to the PDBBind General Set v2020. The sequence identity is the maximum sequence identity between all chains in the PoseBuster test protein and all chains in the PDBBind General Set v2020. The striped bars show the share of predictions of each method that have an RMSD within 2 Å and the solid bars show those predictions which in addition pass all PoseBuster tests and are therefore PB-valid. The DL-based methods perform far better on proteins that are similar to those they were trained on.

References:
[1] Buttenschoen, M., Morris, G.M. and Deane, C.M. (2024) ‘PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences’, Chemical Science, 15(9), pp. 3130–3139. Available at: https://doi.org/10.1039/D3SC04185A.

Author