Antibody developability datasets

Next to binding the antigen with high affinity, antibodies for therapeutic purposes need to be developable. These developability properties includes high expression, high stability, low aggregation, low immunogenicity, and low non-specificity [1]. These properties are often linked and therefore optimising for one property might be at the expense of another. Machine learning methods have been build to guide the optimistation process of one or multiple developability properties.

Performance of these methods is often limited by the amount and type of data available for training. These dataset contain experimental determined scores of biophysical assays related to developability. Some common experimental assays are described in a previous blog post by Matthew Raybould [2]. Here I will discuss some (commonly) used and new dataset related to antibody developability. This list is not exhaustive but might help you start understanding more about antibody developability.

Multi-property datasets

Probably the most well known antibody developability dataset is the dataset described by Jain et al. 2017 [1]. This dataset consist of 137 antibodies in an advanced clinical stage and data on expression (HEK Titer), stability (Tm, SEC), aggregation (CSI-BLI, SMAC, SEC, AC-SINS), hydrophobicity/solubility (HIC), and specificity (CIC, ELISA, PSR).

A recently described dataset from Ariswala et al. 2025 [3] is based on the Jain 2017 dataset. It consist of 246 antibodies in a approved, clinical-stage, or preregistration phase and contains data on expression (Titer), purity (rCE-SDS), stability (nanoDSF, DSF, SEC), aggregation (SEC, SMAC) , hydrophobicity/solubility (HIC), polyreactivity (PR CHO, PR Ova), self-association (AC-SINS, DLS-kD).

Rosace et al. 2023 [4] stores thermostability (Tm), binding (BLI), and poly-reactivity (CIC RT) data for mutations on three nanobodies and three scFvs. This data was used to experimentally validate their co-optimisation pipeline for solubility and stability.

Shehata et al. 2019 [5] determined poly-reactivity (PSR), hydrophobicity (HIC), stability (Tm) for 400 human mAbs and 42 clinical approved therapeutics. In their study they compared sequences derived from memory B-cells (somatically hypermutated sequences) against sequences from naive B-cells (germline-like). They conclude that reduced levels of poly-reactivity, hydrophobicity, and stability are observed in somatic hypermutated sequences.

Two property datasets

Koenig et al. 2022 [6] experimentally measured expression and affinity of all single-point mutations of the high-affinity anti-VEGF antibody (G6.13). This dataset showed optimal mutations in both the antigen-binding site as positions far away from the binding site.

Hie et al. 2022 [7] measured affinity (BLI) and stability (Tm) for variations of seven antibodies generated by their language model guided affinity maturation workflow.

Single property datasets

Expression

Szkodny et al. 2024 [8] determined expression (CHO Titer) of 178 single-point mutation variations of the therapeutic antibody trastuzumab. They showed that mutations decreasing the hydrophobicity of buried residues contributed to lower expression.

Immunogenicity

Marks et al. 2021 [9] collected anti-drug antibody responses from clinical papers for 217 therapeutics. This dataset was originally used to evaluate the computation humanisation method Hu-mAb.

Binding

Shanehsazzadeh et al. 2013 [10] measured binding (SPR) of machine learning designed antibodies against the human epidermal growth factor receptor (HER2). The antibodies contain mutations in the CDRs of the heavy chain.

Sirin et al. 2015 [11] published AbBind. A database of 27 antibodies and 6 general proteins with experimentally determined binding free energy values for in total 1101 mutations. They used this dataset to benchmark the ability of predicting changes in binding free energy of computational methods.

Mason et al. 2021 [12] performed a single-site deep mutational scanning of the CDRH3 of therapeutic antibody trastuzumab to guide their library design. They experimentally expressed 50.000 variants and evaluated binding to HER2 resulting in 36.4k classified mutants.

Chinery et al. 2024 [13] generated a library of 524,000 variations of trastuzumab CDRH3 using various computational methods. After classifying these the 700 top designs were experimentally validated for binding HER2.

Cia et al. 2025 [14] collected data on antibody-antigen complexes and curated a database, called AbAgym, storing deep mutational scanning (DMS) information. The database stores 335k mutations from 67 antibody-antigen DMS experiments.

Non-specificity

Warszawski et al. 2019 [15] determined affinity (SPR) of variations of two antibodies. The dataset contains 1048 datapoints. They used this dataset to validate their AbLIFT model that suggest multi-point core mutations to improve VH-VL contacts.

Boughter et al. 2020 [16] combined previously described datasets of poly-reactive antibodies determined by ELISA. The dataset stores in total 1053 antibodies of which approximately half poly-reactive. 445 of these are mouse antibodies the remaining is human.

Harvey et al. 2022 [17] determined poly-reactivity (PSR) of a diverse naïve synthetic camelid nanobody library. With this database they construct machine learning models to predict poly-reactivity from sequence.

Summary

The dataset discussed are summarised in the table below. Some of these datasets were formatted for benchmarking deep learning models by Chungyoun et al. 2023 [18].

DatasetExpressionStabilityAggregationhydrophobicity/
solubility
BindingPoly-reactivtyAnti-drug antibody
Jain 2017XXXXX
Ariswala 2025XXXXX
Rosace 2023XXX
Shehata 2019XXX
Koenig 2022XX
Hie 2022XX
Szkodny 2024X
Marks 2021X
Shanehsazzadeh 2013X
Sirin 2015X
Mason 2021X
Chinery 2024X
Cia 2025X
Warszawski 2019X
Boughter 2020X
Harvey 2022X

References

[1] Jain, T., Sun, T., Durand, S., Hall, A., Houston, N. R., Nett, J. H., … & Wittrup, K. D. (2017). Biophysical properties of the clinical-stage antibody landscape. Proceedings of the National Academy of Sciences114(5), 944-949.

[2] https://www.blopig.com/blog/2017/07/antibody-developability-experimental-screening-assays/

[3] Arsiwala, A., Bhatt, R., Yang, Y., Quintero Cadena, P., Anderson, K. C., Ao, X., … & Borhani, D. (2025). A high-throughput platform for biophysical antibody developability assessment to enable AI/ML model training. bioRxiv, 2025-05.

[4] Rosace, A., Bennett, A., Oeller, M., Mortensen, M. M., Sakhnini, L., Lorenzen, N., … & Sormanni, P. (2023). Automated optimisation of solubility and conformational stability of antibodies and proteins. Nature communications14(1), 1937.

[5] Shehata, L., Maurer, D. P., Wec, A. Z., Lilov, A., Champney, E., Sun, T., … & Walker, L. M. (2019). Affinity maturation enhances antibody specificity but compromises conformational stability. Cell reports28(13), 3300-3308.

[6] Koenig, P., Lee, C. V., Walters, B. T., Janakiraman, V., Stinson, J., Patapoff, T. W., & Fuh, G. (2017). Mutational landscape of antibody variable domains reveals a switch modulating the interdomain conformational dynamics and antigen binding. Proceedings of the National Academy of Sciences114(4), E486-E495.

[7] Hie, B. L., Shanker, V. R., Xu, D., Bruun, T. U., Weidenbacher, P. A., Tang, S., … & Kim, P. S. (2024). Efficient evolution of human antibodies from general protein language models. Nature biotechnology42(2), 275-283.

[8] Szkodny, A. C., & Lee, K. H. (2024). A systemic approach to identifying sequence frameworks that decrease mAb production in a transient Chinese hamster ovary cell expression system. Biotechnology Progress40(5), e3466.

[9] Marks, C., Hummer, A. M., Chin, M., & Deane, C. M. (2021). Humanization of antibodies using a machine learning approach on large-scale repertoire data. Bioinformatics37(22), 4041-4047.

[10] Shanehsazzadeh, A., Bachas, S., McPartlon, M., Kasun, G., Sutton, J. M., Steiger, A. K., … & Meier, J. (2023). Unlocking de novo antibody design with generative artificial intelligence. BioRxiv, 2023-01.

[11] Sirin, S., Apgar, J. R., Bennett, E. M., & Keating, A. E. (2016). AB‐bind: antibody binding mutational database for computational affinity predictions. Protein Science25(2), 393-409.

[12] Mason, D. M., Friedensohn, S., Weber, C. R., Jordi, C., Wagner, B., Meng, S. M., … & Reddy, S. T. (2021). Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nature biomedical engineering5(6), 600-612.

[13] Chinery, L., Hummer, A. M., Mehta, B. B., Akbar, R., Rawat, P., Slabodkin, A., … & Deane, C. M. (2024). Baselining the buzz Trastuzumab-HER2 affinity, and beyond. BioRxiv, 2024-03.

[14] Cia, G., Li, D., Poblete, S., Rooman, M., & Pucci, F. (2025). AbAgym: a well-curated dataset for the mutational analysis of antibody-antigen complexes. bioRxiv, 2025-07.

[15] Warszawski, S., Borenstein Katz, A., Lipsh, R., Khmelnitsky, L., Ben Nissan, G., Javitt, G., … & Fleishman, S. J. (2019). Optimizing antibody affinity and stability by the automated design of the variable light-heavy chain interfaces. PLoS computational biology15(8), e1007207.

[16] Boughter, C. T., Borowska, M. T., Guthmiller, J. J., Bendelac, A., Wilson, P. C., Roux, B., & Adams, E. J. (2020). Biochemical patterns of antibody polyreactivity revealed through a bioinformatics-based analysis of CDR loops. Elife9, e61393.

[17] Harvey, E. P., Shin, J. E., Skiba, M. A., Nemeth, G. R., Hurley, J. D., Wellner, A., … & Kruse, A. C. (2022). An in silico method to assess antibody fragment polyreactivity. Nature communications13(1), 7554.

[18] Chungyoun, M., Ruffolo, J., & Gray, J. (2024). FLAb: Benchmarking deep learning methods for antibody fitness prediction. BioRxiv, 2024-01.

Author