Alvaro Prat | Oxford Protein Informatics Group

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading →

Introduction

Graphs provide a powerful mathematical framework for modelling complex systems, from molecular structures to social networks. In many physical and geometric problems, nodes represent particles, and edges encode interactions, often acting like springs. This perspective aligns naturally with Geometric Deep Learning, where learning algorithms leverage graph structures to capture spatial and relational patterns.

Understanding energy functions and the forces derived from them is fundamental to modelling such systems. In physics and computational chemistry, harmonic potentials, which penalise deviations from equilibrium positions, are widely used to describe elastic networks, protein structures, and even diffusion processes. The Laplacian matrix plays a key role in these formulations, linking energy minimisation to force computations in a clean and computationally efficient way.

By formalising these interactions using matrix notation, we gain not only a compact representation but also a foundation for more advanced techniques such as Langevin dynamics, normal mode analysis, and graph-based neural networks for physical simulations.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Alvaro Prat

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

Geometric Deep Learning meets Forces & Equilibrium

Introduction