Let your library design blosum

During the lead optimisation stage of the drug discovery pipeline, we might wish to make mutations to an initially identified binding antibody to improve properties such as developability, immunogenicity, and affinity.

There are many ways we could go about suggesting these mutations including using Large Language Models e.g. ESM and AbLang, or Inverse Folding methods e.g. ProteinMPNN and AntiFold. However, some of our recent work (soon to be pre-printed) has shown that classical non-Machine Learning approaches, such as BLOSUM, could also be worth considering at this stage.

BLOSUM matrices (BLOcks SUbstitution Matrices) simply describe how often each amino acid is substituted with all other amino acids when considering similar, aligned proteins. Common minimum sequence similarity thresholds used are 45%, 62%, and 80%, with each cut-off resulting in different final matrices. These matrices are most often displayed as 20 x 20 arrays of integers, where positive and negative values indicate likely and unlikely substitutions respectively.

Though these matrices were generated from observations of all proteins, we can reverse engineer these with antibodies in mind; the goal here being to obtain substitution likelihoods (that sum to one) that could be used to guide mutations. To obtain these BLOSUM likelihoods, it is useful to examine how BLOSUM matrices are calculated:

A full description of this formula can be found here but in brief, a and b are two dummy amino acids, s(a,b) are the integer BLOSUM scores, f_a,b are background frequencies with which a and b occur, lambda is a scaling factor, and p_ab are the probabilities we wish to obtain – how often is a substituted with b, and vice versa.

When considering mutations to an antibody’s CDR loops, BLOSUM-45 is a good matrix to choose as our starting point due to the highly variable nature of these loops. If you are interested in mutating the framework region of an antibody, it may be worth considering using BLOSUM-62 or BLOSUM-80 matrices instead.

We can obtain antibody-specific amino acid background frequencies, f_a,b, by using an antibody database, such as SAbDab or OAS e.g.

Finally, we can combine the above, tweaking our value of lambda if we wish, to obtain substitution likelihoods, p_ab, for an example CDRH3, such as Trastuzumab’s – WGGDGFYAMD.

All the code to generate the above plots and design your own libraries can be found in the following Colab Notebook.

Author