In our recent work with the PoseBusters benchmark, we made a deliberate choice: to include both receptors seen during training and completely novel ones. Why? To explore an often-overlooked question: how much does receptor similarity to training data influence model performance?
As shown in Figure 4 of our paper [1], we split the PoseBusters dataset into three groups based on sequence identity to receptors in PDBBind2020 (the dataset the models were trained on): high, medium, and low similarity. The results were clear—the more similar a test receptor is to the training set, the better the model tends to perform.
This type of analysis is crucial. Reporting a single overall metric can mask important differences in model generalization. By stratifying results based on similarity, we gain a better understanding of where models succeed and where they struggle.
Two points are important to keep in mind. First, this stratification must be relative to the specific training data used. If a new model was trained on all PDB complexes up to 2021, then the PoseBusters splits (based on PDBBind2020) would not be directly applicable, as PDBBind2020 is only a subset of that data. Second, while our paper included results stratified by receptor sequence identity, similarities in binding pockets, ligands, and interaction patterns also play significant roles.
So, next time you see a headline number for pose prediction accuracy, it’s worth asking: how different was the test data from the training data? How similar were the receptors, ligands, and their interactions? As our Figure 4 demonstrates, these factors matter significantly.

References:
[1] Buttenschoen, M., Morris, G.M. and Deane, C.M. (2024) ‘PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences’, Chemical Science, 15(9), pp. 3130–3139. Available at: https://doi.org/10.1039/D3SC04185A.