Using Antibody Next Generation Sequencing data to aid antibody engineering

       I consider myself a wet lab scientist and I had not done any dynamic programming language like Python before starting my DPhil. My main interests lie in development of improved antibody humanization campaigns, rational antibody phage display library constructions and antibody evolution. Having completed industrial placement at MedImmune, I saw the biotechnology industry from the inside and realized that scientists who could bridge computer science and wet lab fields are in high demand.

      The title of my DPhil is very broad, and research itself is data rather than hypothesis driven. Our research group collaborates with UCB Pharma, which has sequenced whole antibody repertoires across a number of species. Datasets might contain more than 10 million sequences of heavy and light variable chains. But even these datasets do not cover more than 1% of the theoretical repertoire, hence looking at entropies of sequences rather than mere sequences could provide insights into differences between intra- and inter- species datasets.

        NGS of antibody repertoires provides snapshots of repertoire diversity, entropy as well as sequences. Reddy, S.T. et al 2010 showed that this information could be successfully used to pull target specific variable chains. But most of research groups believe that main application of NGS is immunodiagnostics (Grieff et al., 2015).

       My project involves applying software developed by our research group namely, Anarci (Dunbar J and Deane CM., 2016) and ABodyBuilder (Leem J. et al 2016). Combination of both softwares allows analysis of NGS datasets at an unprecedented rate (1 million sequences per 7 hours). A number of manipulations can be performed on datasets to standardize them and make data reproducible, which is a big issue in science. It is possible to re-assign germlines, numbering schemes and complementary determining region (CDR) definitions of a 10 million dataset in less than a day. For instance, UCB provided data required our variable chains to be re-numbered according to IMGT numbering and CDR definition (Lefranc M., 2011). The reason for the IMGT numbering scheme selection is that it supports symmetrical amino acid numbering of CDRs, which allows for improved assignment of positions to amino acids that are located in the same structural space between different length CDRs (Figure 1).

                Figure 1. IMGT numbering and CDR definition of CDR3. Symmetrical assignment of positions to amino acids in HCDR3 allows for better localization of V,D,J genes: V gene encodes for the amino terminus, J gene encodes the carboxyl terminus of CDR3, and D gene the mid portion.

       To sum up, analysis of CDR lengths, CDR and framework amino acid compositions, finding novel patterns in antibody repertoires will open up new rational steps of antibody humanization and affinity maturation. The key step will be to determine amino acid scaffolds that define humanness of antibody or in other words, scaffolds that are not immunogenic in humans.


  1. Dunbar J., and Deane CM., ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics (2016)
  2. Grieff V., A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine (2015)
  3. Leem J., et al. ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimation. mAbs. (2016)
  4. Lefranc M., IMGT, the International ImMunoGeneTics Information System. Cold Spring Harb Protoc. (2011)
  5. Reddy ST., et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat Biotech. (2010)