Category Archives: Databases

Nanobodies® galore in Utrecht

At the end of September, I had the opportunity to present at the 4th Single-Domain Antibody (sdAb/VHH) Conference hosted in the city of Utrecht. The sdAb conference is a biennial event, and was held for the first time in Bonn (2019), then in Brussels (2021) and Paris (2023), before coming to the Netherlands this year.

This was the first time I’d attended a VHH-focused conference, and I was taken aback at just how large the community is; the Jaarbeurs ‘Supernova’ event hall was completely sold out, with over 400 researchers in attendance (pictures below courtesy of the organisers). The buzz reflects the ever growing interest in sdAbs as tools to discover new fundamental biology, vectors for diagnosing disease, and as prophylactic or curative therapeutics. Most every disease indication was represented at the conference, from anticancer and antiviral sdAbs to antivenom sdAbs (both for use in lateral flow tests to diagnose the snake that bit you, and as quick ‘epipen’-like therapeutics accessible even in the most remote parts of the world).

Continue reading

Exploring the Protein Data Bank programmatically

The Worldwide Protein Data Bank (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data. Most researchers interact with the PDB either by downloading and parsing individual entries as mmCIF files (or as legacy PDB files), or by downloading aggregated data, such as the RCSB‘s collection in a single FASTA file of all polymer entity sequences. All too often, researchers end up laboriously writing their own file parsers to digest these files. In recent years though, more sophisticated tools have been made available that make it much easier to access only the data that you need.

Continue reading

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading

Antibody developability datasets

Next to binding the antigen with high affinity, antibodies for therapeutic purposes need to be developable. These developability properties includes high expression, high stability, low aggregation, low immunogenicity, and low non-specificity [1]. These properties are often linked and therefore optimising for one property might be at the expense of another. Machine learning methods have been build to guide the optimistation process of one or multiple developability properties.

Performance of these methods is often limited by the amount and type of data available for training. These dataset contain experimental determined scores of biophysical assays related to developability. Some common experimental assays are described in a previous blog post by Matthew Raybould [2]. Here I will discuss some (commonly) used and new dataset related to antibody developability. This list is not exhaustive but might help you start understanding more about antibody developability.

Continue reading

Diagnostics on the Cutting Edge, Software in the Stone Age: A Microbiology Story

The need to treat and control infectious diseases has challenged humanity for millennia, driving a series of remarkable advancements in diagnostic tools and techniques. One of the earliest known legal texts, the Code of Hammurabi, references the visual and tactile diagnosis of leprosy. For centuries, the distinct smell of infected wounds was used to identify gangrene, and in Ancient Greece and Rome, the balance of the four humors (blood, phlegm, black bile, and yellow bile) was a central theory in diagnosing infections.

The invention of the compound microscope in 1590 by Hans and Zacharias Janssen, and its refinements by Robert Hooke and Antonie van Leeuwenhoek, marked a turning point as it enabled the direct observation of microorganisms, thereby linking diseases to their microbial origins. Louis Pasteur’s introduction of liquid media aided Joseph Lister in identifying microbes as the source of surgical infections, whilst Robert Koch’s experiments with Bacillus anthracis firmly established the connection between specific microbes and diseases.

Continue reading

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading