A more robust way to split data for protein-ligand tasks?

As I was recently reading through the paper on the PLINDER dataset while preparing for my next project, one of the aspects of the dataset that caught my attention was how the dataset splits were done to ensure minimal leakage for various protein-ligand tasks that PLINDER could be used for. They had task-specific splits as the notion of data leakage differed from task to task. For instance, in rigid body docking, having a similar protein in the train and test may not be considered leakage if the binding pocket location, conformation, or pocket interactions with a ligand are significantly different. On the other hand, in the case of co-folding, having similar proteins in the train and test sets would be considered data leakage, as predicted protein structures play a significant role in accuracy scoring. The effort that went into creating task-specific splits resonates strongly with OPIG’s view on ensuring minimal data leakage for validating the generalisability of protein-ligand models. However, it may become tedious to create task-specific dataset splits for every protein-ligand task when dealing with a large suite of such tasks. This had me thinking of potential avenues to streamline the dataset split process across the tasks, and one way to do this is by using protein-ligand interaction fingerprints or PLIFs.

There are a few types and implementations of PLIFs. In this short blog post, we will be take a quick look at SPLIF, or structural PLIF, which is a three-dimensional molecular interaction fingerprint. SPLIF encodes the 3D structures of interacting ligand and protein fragments and interaction modes, and it considers various non-covalent interactions like π-π stacking. Similar to other fingerprinting techniques, a radius is defined to consider the interactions and atoms within that radius. A diagram of the step-by-step breakdown of SPLIF from the original paper is given below:

Just like how we would cluster molecules based on their Morgan fingerprints and Tanimoto similarity, we could use PLIFs to cluster the various protein-ligand interactions found in a dataset, where different clusters could be added to the train and test set to ensure diversity and minimal leakage in terms of interaction profiles. Using such fingerprints could potentially remove the need to manually design dataset splits for each and every protein-ligand interaction task.

To explore and use PLIFs, you could take a look at the various Python libraries that use different kinds of PLIFs to generate the fingerprints:

  1. ProLIF: https://github.com/chemosim-lab/ProLIF
  2. PyPLIF: https://pyplif-hippos.readthedocs.io/
  3. PLIP: https://github.com/pharmai/plip

Author