Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

1. Pre-training and transfer learning

One solution to limited data availability is to turn to a different, related problem with more data. Pre-training on a data-rich problem can be followed by transfer learning on the desired task (by initializing the weights from the pre-trained model) (Figure 1). The underlying concept is that the pre-trained model will learn the fundamentals of the system – for example, the physics governing protein-protein interactions – and the pre-trained weights will be closer to the optimum for the desired task. As such, less data would be required to calibrate the weights for the desired task.

Figure 1. Schematic of transfer learning.

There are many considerations for pre-training. The initial pre-training task can be supervised or unsupervised, although the latter is common due to the limitations in supervised (labelled) data availability. The model parameters – such as learning rate, learning rate scheduler, weight decay, optimizer, etc. – for transfer learning can also be adjusted. Additionally, the weights for some layers can be frozen (left unchanged during further training), and layers can be added/removed.

A note about terminology – for tasks that are in the same dataset domain (have the same data type/format) as the original pre-training task, updating the model weights is typically referred to as fine-tuning (e.g. pre-training on general protein data followed by fine-tuning on antibody data).

Examples where pre-training followed by transfer learning/fine-tuning has been applied include:

  • Fine-tuning protein structure prediction networks for peptide binding specificity prediction [2]
  • Geometric encoder to reconstruct perturbed protein structure, followed by transfer learning to change in affinity prediction [3]
  • Protein family-specific scoring functions for small molecule virtual screening [4]
  • Transfer learning to make predictions about network biology [5]

Pre-trained unsupervised models (e.g. protein language models) can also be used for “zero-shot” prediction, where no further transfer learning/fine-tuning is done – e.g.:

  • Inverse folding model (ESM-IF1) applied to tasks such as protein stability prediction [6]
  • Codon language model (CaLM) applied to tasks such as transcript abundance prediction [7]

2. Synthetic data

Synthetic data is generated computationally rather than from experiments. In order to be used successfully, the synthetic data must be representative of biological data, or at least contain sufficient signal for the model to learn.

Synthetic data has been used to augment experimental training data for ML model development. For example, ESM-IF1 [6], an inverse folding model, was trained on both experimental protein structures (thousands) and AlphaFold2 models of protein sequences (millions) (Figure 2). The modelled structures expanded their training dataset drastically: the authors employed a 1:80 experimental:predicted structure ratio in training.

Figure 2. ESM-IF1 training scheme with solved (CATH) and predicted (AlphaFold2) structures. Figure reproduced from [6].

Synthetic datasets have been created for ML development: for example, Absolut for antibody specificity prediction [8] and an antibody-antigen ddG dataset to investigate the amount/diversity of experimental data which will be required for robust prediction [1].

A synthetic dataset could also be used for pre-training followed by fine-tuning on experimental data (for example, as was presented by Sam Gelman at a recent ML4ProteinEngineering seminar).

3. Semi-supervised learning

In addition to supervised and unsupervised ML, there is also semi-supervised ML. Semi-supervised approaches involve training on a moderate amount of labeled data and a large amount of unlabelled data (Figure 3). A model trained on the labeled data is used to provide labels for the unlabeled data, which can then be combined with the original labeled dataset for further training.

Semi-supervised learning has been applied to, for example, single-cell multi-omics [9] and Hidden Markov Models for biological sequence analysis [10].

Figure 3. Schematic of semi-supervised learning. Figure reproduced from [11].

4. Meta-learning

In a similar vein, meta-learning is applicable to cases with a small amount of clean data (“meta set”) and large amount of noisy data. Meta-learning approaches include learning to reweight, in which the importance of data points is weighted (for example by down-weighting suspected noisy data points), and meta label correction, which attempts to correct data labels.

The application of meta-learning to an antibody binder dataset (antibodies binding HER2 with varying affinities) greatly improved robustness to noise and performance on small amounts of training data [12] (Figure 4).

Figure 4. Meta-learning approach for antibody binder data. Figure reproduced from [12].

Extra 1: Active learning

If you do not have enough data for the model you are trying to train – but have the ability to conduct further experiments – active learning can be used to explore what data is needed to improve model performance. Active learning involves iterative cycles (Figure 5) of:

  1. Model training – on available labeled data
  2. Querying – identifying areas in the data space, where the model has high uncertainty and/or low coverage
  3. Data collection/labeling – for data points in the high uncertainty and/or low coverage space
  4. Appending – adding the new labeled data to the training data
Figure 5. Schematic of the iterative steps in active learning. Figure reproduced from [13].

Extra 2: Machine learning-grade data

To overcome challenges in data availability … we will need more data! But not just any kind of data. It will be essential to consider machine learning model development in data generation (to generate “machine learning-grade data”). This will include using standard processes for generation, estimations of uncertainty and assessments of bias and dataset diversity.

–––

References

  1. Hummer et al., bioRxiv, 2023 – https://www.biorxiv.org/content/10.1101/2023.05.17.541222v1
  2. Motmaen et al., bioRxiv, 2022 – https://www.biorxiv.org/content/10.1101/2022.07.12.499365v1
  3. Liu et al., PLoS Comp Bio, 2021 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009284
  4. Imrie et al., J Chem Inf Model, 2018 – https://pubs.acs.org/doi/10.1021/acs.jcim.8b00350
  5. Theodoris et al., Nature, 2023 – https://www.nature.com/articles/s41586-023-06139-9
  6. Hsu et al., bioRxiv, 2022 – https://www.biorxiv.org/content/10.1101/2022.04.10.487779v2
  7. Outeiral and Deane, bioRxiv, 2022 – https://www.biorxiv.org/content/10.1101/2022.12.15.519894v1
  8. Robert et al., Nat Comp Sci, 2022 – https://www.nature.com/articles/s43588-022-00372-4
  9. Wang et al., PNAS Nexus, 2022 – https://academic.oup.com/pnasnexus/article/1/4/pgac165/6672590
  10. Tamposis et al., bioRxiv, 2019 – https://academic.oup.com/bioinformatics/article/35/13/2208/5184961
  11. https://teksands.ai/blog/semi-supervised-learning
  12. Minot and Reddy, bioRxiv, 2023 – https://www.biorxiv.org/content/10.1101/2023.01.30.526201v1
  13. https://blogs.nvidia.com/blog/2020/01/16/what-is-active-learning/

Author