Author Archives: jaroslaw nowak

Computational immunogenicity reduction

In my last presentation, I talked about the article by King et al. describing a method for computationally removing T-cell receptor epitopes from proteins. The work could have significant impact on the field of designing protein therapeutics, where immunogenicity is a serious obstacle.

One of the major challenges when developing a protein therapeutic is the activation of the immune system by the drug and subsequent production of antibodies against it, rendering the therapeutic ineffective. This process is known as immunogenicity. Immunogenicity is triggered by T-cells recognition of peptide epitopes displayed on the MHC (major histocompatibility complex). This recognition can be impeded by designing the protein therapeutic to remove the potential T-cell epitopes from its surface. There has been some success in experimental T-cell epitope removal, but the process remains resource and time consuming.

In this work, King et al. created a function which assigns to each residue a score that measures its propensity to be a part of a T-cell epitope. The score consists of three parts. The first part is based on a SVM (Support Vector Machine) score calculated over each 15-residue long window, that attempts to predict how likely is the corresponding peptide sequence to bind the MHC. The SVM has been trained on the available immunological data from the Immune Epitope Database (IEDB). The second part of the score is calculated on each 9-residue window and compares the frequency of the 9-mer in the host genomic data and in the known epitope data (a sequence occurring with a high frequency in a human genome would be rewarded while the opposite is true for sequences occurring in the known epitope data). The third part penalizes any deviations from the original charge of the protein. These three parts are combined with a standard Rosetta score that measures the stability of the protein. The weights assigned to each segment were calibrated on existing protein structures. The combined score would be used to score the mutations in the sequence of the protein of interest, according to their propensity of reducing immunogenicity. The top scoring mutations would then be combined in a greedy fashion.

The authors tested their method on fluorescent reporter protein superfolder GFP (sfGFP) and the toxin domain of the cancer therapeutic HA22. In the case of sfGFP the authors targeted the four top-scoring T-cell epitopes. They created eight different proteins designs, out of which all preserved the function of the original protein (fluorescence). The authors selected the top scoring design for experimental immunogenicity testing. The experiments have shown that the selected design had a significantly reduced immunogenicity in comparison to the original protein. In the case study of HA22 the authors created five designs, out of which three displayed cytotoxicities at the same level or higher than the original protein. The two most cytotoxic designs were further characterized experimentally for their propensity to induce immune response. The authors have found that the two mutants elicited a significantly reduced T-cell response.

Figure 1: Reduction of immunogenicity without loss of function. A) Three of the five designs show cytotoxicity at the same level or higher than the original protein. B) Two of the three cytotoxicity-preserving designs show reduced immunogenicity

Overall, this very interesting study showed that computational methods can be successfully used for reducing immunogenicity of protein therapeutics, opening new avenues for computational protein design.


Transgenic Mosquitoes

At the meeting on November 15 I have covered a paper by Gantz et al. describing a method for creating transgenic mosquitoes expressing antibodies hindering the development of malaria parasites.

The immune system is commonly divided into two categories: innate and adaptive. The innate immune system consists of non-specific defence mechanisms such as epithelial barriers, macrophages etc. The innate system is present in virtually every living organism. The adaptive immune system is responsible for invader-specific defence response. Is consists of B and T lymphocytes and encompasses antibody production. As only vertebrates posses the adaptive immune system, mosquitoes do not naturally produce antibodies which hinders their ability to defend themselves against pathogens such as malaria.

In the study by Gantz et al. the authors inserted transgenes expressing three single-chain Fvs (m4B7, m2A10 and m1C3) into the previously-characterised chromosomal docking sites.

Figure 1: The RT-PCR experiments showing the scFv expression in different mosquito strains

RT-PCR was used to detect scFv transcripts in RNA isolated from the transgenic mosquitoes (see Figure 1). The experiments showed that the attP 44-C recipient line allowed expression of the transgenes coding for the scFvs.

The authors evaluated the impact of the modifications on the fitness of the mosquitoes. It was shown that the transgene expression does not reduce the lifespan of the mosquitoes, or their ability to procreate.

Expression of the scFvs targeted the parasite at both the early and late development stages. The transgenic mosquitoes displayed a significant reduction in the number of malaria sporozoites per infected female, in most cases completely inhibiting the sporozoite development.

Overall the study showed that it is possible to develop transgenic mosquitoes that are resistant to malaria. If this method was combined with a mechanism for a gene spread, the malaria-resistant mosquitoes could be released into the environment, helping to fight the spread of this disease.

Designing antibodies targeting disordered epitopes

At the meeting on February 10 I covered the article by Sormanni et al. describing a methodology for computationally designing antibodies against intrinsically disordered regions of proteins.

Antibodies are proteins that are a natural part of our immune system. For over 50 years lab-made antibodies have been used in a wide variety of therapeutic and diagnostic applications. Nowadays, we can design antibodies with high specificity and affinity for almost any target. Nevertheless, engineering antibodies against intrinsically disordered proteins remains costly and unreliable. Since as many as about 33.0% of all eukaryotic proteins could be intrinsically disordered, and the disordered proteins are often implicated in various ailments and diseases such methodology could prove invaluable.

Cascade design

Cascade design

The initial step in the protocol involves searching the PDB for protein sequences that interact in a beta strand with segments of the target sequence. Next, such peptides are joined together using a so-called “cascade method”. The cascade method starts with the longest found peptide and grows it to the length of the target sequence by joining it with other, partially overlapping peptides coming from beta strands of the same type (parallel, antiparallel). In the cascade method, all fragments used must form the same hydrogen bond pattern. The resulting complementary peptide is expected to “freeze” part of the discorded protein by forcing it to locally form a beta sheet. After the complementary peptide is designed, it is grafted on a single-domain antibody scaffold. This decision has been made as antibodies have a longer half-life and lower immunogenicity.

To test their method the authors initially assessed the robustness of their design protocol. First, they run the cascade method on three targets – a-synuclein, Aβ42 and IAPP. They found that more than 95% of the residue position in the three proteins could be targeted by their method. In addition, the mean number of available fragments per position was 570. They also estimated their coverage on a larger scale, using 1690 disordered protein sequences obtained from DisProt database and from measured NMR chemical shifts. About 90% of residue positions from DisProt and 85% positions from the chemical shift could be covered by at least one designed peptide. The positions that were hard to target usually contained Proline, in agreement with the known result that Prolines tend to disrupt secondary structure formation.

To test the quality of their designs the authors created complementary peptides for a-synuclein, Aβ42 and IAPP and grafted them on the CDR3 region of a human single domain antibody scaffold. All designs were highly stable and bound their targets with high specificity. Following the encouraging result the authors measured the affinity of one of their designs (one of the anti-a-synuclein antibodies). The K­d was found to lie in the range 11-27 μM. Such affinity is too low for pharmaceutical purposes, but it is enough to prevent aggregation of the target protein.

As the last step in the project the authors attempted a two-peptide design, where a second peptide was grafted in the CDR2 region of the single-domain scaffold. Both peptides were designed to bind the same epitope. The two peptide design managed to reach the affinity required for pharmaceutical viability (affinity smaller than 185 nM with 95% confidence). Nevertheless, the two loop design became very unstable rendering it not viable for pharmaceutical purposes.

Overall, this study presents a very exciting step towards computationally designed antibodies targeting disordered epitopes and deepens out understanding of antibody functionality.

Next generation sequencing of paired heavy and light chain sequences

At the last meeting before Christmas I covered the article by DeKosky et al. describing a new methodology for sequencing of paired VH-VL repertoire developed by the authors.

In the recent years there have been an exponential growth of available antibody sequences, caused mainly by the development of cheap and high-throughput Next Generation Sequencing (NGS) technologies. This trend led to the creation of several publicly available antibody sequence databases such as the DIGIT database and the abYsis database, containing hundreds of thousands of unpaired light chain and heavy chain sequences from over 100 species. Nevertheless, the sequencing of paired VH-VL repertoire remained a challenge, with the available techniques suffering from low throughput (<700 cells) and high cost. In contrast, the method developed by DeKosky et al. allows for relatively cheap paired sequencing of most of the 10^6 B cells contained within a typical 10-ml blood draw.

The work flow is as follows: first the isolated cells, dissolved in water, and magnetic poly(dT) beads mixed with cell lysis buffer are pushed through a narrow opening into a rapidly moving annular oil phase, resulting in a thin jet that coalescences into droplets, in such a way that each droplet has a very low chance of having a cell inside it. This ensures that the vast majority of droplets that do contain cells, contain only one cell each. Next, the cell lysis occurs within the droplets and the mRNA fragments coding for the antibody chains attach to the poly(dT) beads. Following that, the mRNA fragments are recovered and linkage PCR is used to generate 850 bp cDNA fragments for NGS.

To analyse the accuracy of their methodology the authors sequenced paired CDR-H3 – CDR-L3 sequences from blood samples obtained from three different human donors, filtering the results by 96% clustering, read-quality and removing sequences with less than two reads. Overall, this resulted in ~200,000 paired CDR-H3 – CDR-L3 sequences. The authors found that pairing accuracy of their methodology was ~98%.

The article also contained some bioinformatics analysis of the data. The authors first analysed CDR-L3 sequences that tend to pair up with many diverse CDR-H3 sequences and whether such “promiscuous” CDR-L3s are also “public” i.e. they are promiscuous and common in all three donors. Their results show that out of 50 most common promiscuous CDR-L3s 49 are also public. The results also show that the promiscuous CDR-L3s show little to no modification, being very close to the germline sequence.

Illustration of the sequencing pipeline

The sequencing data also contained examples of allelic inclusion, where one B-cell expresses two B cell receptors (almost always one VH gene and two distinct VL genes). It was found that about ~0.5% of all analysed B-cells showed allelic inclusion.

Finally, the authors looked at the occurrence of traits commonly associated with broadly Neutralizing Antibodies (bNAbs), produced to fight rapidly mutating pathogens (such as the influenza virus). These traits were short (<6 aa) CDR-L3 and long (11 – 18 aa) CDR-H3s. In total, the authors found 31 sequences with these features, suggesting that bNAbs can be found in the repertoire of healthy donors.

Overall this article presents very interesting and promising method, that should allow for large-scale sequencing of paired VH-VL sequences.

Antibody binding site re-design

In this blog post I describe three successful studies on structure based re-design of antibody binding sites, leading to significant improvements of binding affinity.

In their study Clark et al.[1] re-designed a binding site of antibody AQC2 to improve its binding affinity to the I domain of human integrin VLA1. The authors assessed the effects of the mutations on the binding energy using the CHARMM[2,3] potential with the electrostatic and desolations energies calculated using the ICE software[4]. In total, 83 variants were identified for experimental validation, some of which included multiple mutations. The mutated antibodies were expressed in E. Coli and the affinity to the antigen was measured. The best mutant included a total of four mutations which improved the affinity by approximately one order of magnitude from 7 nM to 850 pM. The crystal structure of the best mutant was solved to further study the interaction of the mutant with the target.

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Lippow et al.[5] studied the interactions of three antibodies – the anti-epidermal growth factor receptor drug cetuximab[6], the anti-lysozyme antibody D44.1 and the anti-lysosyme antibody D1.3 with their respective antigens. The energy calculations favoured mutations to large amino acids (such as Phe or Trp) of which most were found to be false positives. More accurate results were obtained using only the electrostatic term of the energy function. The authors improved the binding affinity of D44.1 by one order of magnitude and the affinity of centuximab by 2 orders of magnitude. The antibody D1.3 didn’t show many opportunities for electrostatic improvement and the authors suggest it might be an anomalous antibody.

Computational methods have recently been used to successfully introduce non-canonical amino acids (NCAA) into the antibody binding site. Xu et al.[7] introduced L-DOPA (L-3,4-dihydroxephenyalanine) into the CDRs of anti-protective antigen scFv antibody M18 to crosslink it with its native antigen. The authors used the program Rosetta 3.4 to create models of antibody-antigen complex with L-DOPA residues. The distance between L-DOPA and a lysine nucleophile was used as a predictor of crosslinking was. The crosslinking efficiency was quantified as a fraction of antibodies that underwent a mass change, measured using Western blot assays. The measured average efficiency of the mutants was 10% with the maximum efficiency of 52%.

[1]      Clark LA, Boriack-Sjodin PA, Eldredge J, Fitch C, Friedman B, Hanf KJM, et al. Affinity enhancement of an in vivo matured therapeutic antibody using structure-based computational design. Protein Sci 2006;15:949–60. doi:10.1110/ps.052030506.

[2]      Brooks BR, Bruccoleri RE, Olafson DJ, States DJ, Swaminathan S, Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J Comput Chem 1983;4:187–217.

[3]      MacKerel Jr. AD, Brooks III CL, Nilsson L, Roux B, Won Y, Karplus M. CHARMM: The Energy Function and Its Parameterization with an Overview of the Program. In: v. R. Schleyer et al. P, editor. vol. 1, John Wiley & Sons: Chichester; 1998, p. 271–7.

[4]      Kangas E, Tidor B. Optimizing electrostatic affinity in ligand–receptor binding: Theory, computation, and ligand properties. J Chem Phys 1998;109:7522. doi:10.1063/1.477375.

[5]      Lippow SM, Wittrup KD, Tidor B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat Biotechnol 2007;25:1171–6. doi:10.1038/nbt1336.

[6]      Sato JD, Kawamoto T, Le AD, Mendelsohn J, Polikoff J, Sato GH. Biological effects in vitro of monoclonal antibodies to human epidermal growth factor receptors. Mol Biol Med 1983;1:511–29.

[7]      Xu J, Tack D, Hughes RA, Ellington AD, Gray JJ. Structure-based non-canonical amino acid design to covalently crosslink an antibody-antigen complex. J Struct Biol 2014;185:215–22. doi:10.1016/j.jsb.2013.05.003.

Clustering Algorithms

Clustering is a task of organizing data into groups (called clusters), such that members of each group are more similar to each other than to members of other groups. This is a brief description of three popular clustering algorithms – K-Means, UPGMA and DBSCAN.

Cluster analysis

Cluster analysis

K-Means is arguably the simplest and most popular clustering algorithm. It takes one parameter – the expected number of clusters k. At the initialization step a set of k means m1, m2,…, mk is generated (hence the name). At each iteration step the objects in the data are assigned to the cluster whose mean yields the least within-cluster sum of squares. After the assignments, the means are updated to be centroids of the new clusters. The procedure is repeated until convergence (the convergence occurs when the means no longer change between iterations).

The main strength of K-means is that it’s simple and easy to implement. The largest drawback of the algorithm is that one needs to know in advance how many clusters the data contains. Another problem is that with wrong initialization the algorithm can easily converge to a local minimum, which may result in suboptimal partitioning of data.

UPGMA is a simple hierarchical clustering method, where the distance between two clusters is taken to be the average of distances between individual objects in the clusters. In each step the closest clusters are combined, until all objects are in clusters where the average distance between objects is lower than a specified cut-off.

The UPGMA algorithm is often used for construction of phenetic trees. The major issue with the algorithm is that the tree it constructs is ultrametric, which means that the distance from root to any leaf is the same. In context of evolution, this means that the UPGMA algorithm assumes a constant rate of accumulation of mutations, an assumption which is often incorrect.

DBSCAN is a density-based algorithm which tries to separate the data into regions of high density, labelling points that lie in low-density areas as outliers. The algorithm takes two parameters – ε and minPts and looks for points that are density-connected with respect to ε. A point p is said to be density reachable from point q if there are points between p and q, such that one can traverse the path from p to q never moving further than ε in any step. Because the concept of density-reachability is not symmetric a concept of density-connectivity is introduced. Two points p and q are density-connected if there is a point o such that both p and q are density reachable from o. A set of points is considered a cluster if all points in the set are mutually density-connected and the number of points in the set is equal to or greater than minPts. The points that cannot be put into clusters are classified as noise.

a)Illustrates the concept of density-reachability.  b)  Illustrates the concept of density-connectivity

a) Illustrates the concept of density-reachability.
b) Illustrates the concept of density-connectivity

The DBSCAN algorithm can efficiently detect clusters with non-globular shapes since it is sensitive to changes in density only. Because of that, the clustering reflects real structure present in the data. The problem with the algorithm is the choice of parameter ε which controls how large the difference in density needs to be to separate two clusters.