Tracking the change in ML performance for popular small molecule benchmarks

The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.

To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.

The second benchmark is the USPTO 50K benchmark, a random subset of chemical reactions scraped by Lowe et al. from the US Patent Office, used to test the accuracy of single-step prediction methods used in retrosynthesis tools (LINK). Accuracy for this benchmark was measured using Top-1 Accuracy on the test set of the 80/10/10 split.

The third and final benchmark is the popular HIV protease activity classification set, incorporated into the MoleculeNet set of benchmarks (LINK). This scaffold train/validation/test split (80/10/10) has benchmarked a range of machine learning techniques and quantitative-structure activity relationship (QSAR) methods, with AUROC as the metric.

To collect the data, I systematically went through the literature that cited the source of the benchmark or the first significant paper that suggested it as a benchmark, using Scopus (scopus.com). For every calendar year, I would take the top 10 cited papers that had results for the benchmark to reduce the amount of literature I had to review. The data was plotted according to publication date and classified according to the broad type of ML architecture used and (subjectively) the most important; for example, models that used a combination of GNN and transformer, I classified as GNN. I have provided the data in the following repo: https://github.com/guydurant/tracking_small_molecules_benchmarks

The plots (figures above) did reveal some interesting results: both CASF 2016 and HIV Molecule Net showed no improvement over time and had their top performance back in 2020. These results show that the proposed methods are not improving over time on these benchmarks. However, we cannot say this means the methods are not improving, as both benchmarks are flawed. CASF 2016 is highly similar to its commonly used training data, the PDBBind General set, with only one protein out of the 285 data points having less than 90% sequence similarity to any training point (LINK). For the HIV protease activity set, 70% of the 404 actives within it have substructure matches for potential PAIN (pan-assay interfering) substructures (LINK). PAINs compounds are compounds that appear to non-specifically interact with the target of interest or with the mechanism of the assay directly. Such a high percentage of the actives being potential PAINs suggests that this benchmark rewards identifying PAINs substructures rather than just learning helpful structure-activity relationships.

However, the trend for USPTO-50k was different (figure above), with increasing accuracy performance over time. However, we cannot definitively say that improvement in this benchmark means the methods are getting more accurate. A recent study found that accuracy on single-step reactions does not mean that methods are accurate when combined with a path-finding algorithm, like a Monte Carlo tree search (LINK). On top of this, they found that the small size of USPTO-50k meant that often models were not appropriate or capable of being trained on the larger datasets available for these methods.

So does tracking the improvement of methods on benchmarks help us know if we are improving? Not really, if anything it points more to the usefulness of a benchmark. If all methods are doing the same on a benchmark, it is not a useful benchmark for discerning differences between methods. It is very difficult to determine whether methods are improvements in the field and efforts to define improvement and measure it are incredibly difficult and require almost as much work as method development.

Author