Computational immunogenicity reduction

In my last presentation, I talked about the article by King et al. describing a method for computationally removing T-cell receptor epitopes from proteins. The work could have significant impact on the field of designing protein therapeutics, where immunogenicity is a serious obstacle.

One of the major challenges when developing a protein therapeutic is the activation of the immune system by the drug and subsequent production of antibodies against it, rendering the therapeutic ineffective. This process is known as immunogenicity. Immunogenicity is triggered by T-cells recognition of peptide epitopes displayed on the MHC (major histocompatibility complex). This recognition can be impeded by designing the protein therapeutic to remove the potential T-cell epitopes from its surface. There has been some success in experimental T-cell epitope removal, but the process remains resource and time consuming.

In this work, King et al. created a function which assigns to each residue a score that measures its propensity to be a part of a T-cell epitope. The score consists of three parts. The first part is based on a SVM (Support Vector Machine) score calculated over each 15-residue long window, that attempts to predict how likely is the corresponding peptide sequence to bind the MHC. The SVM has been trained on the available immunological data from the Immune Epitope Database (IEDB). The second part of the score is calculated on each 9-residue window and compares the frequency of the 9-mer in the host genomic data and in the known epitope data (a sequence occurring with a high frequency in a human genome would be rewarded while the opposite is true for sequences occurring in the known epitope data). The third part penalizes any deviations from the original charge of the protein. These three parts are combined with a standard Rosetta score that measures the stability of the protein. The weights assigned to each segment were calibrated on existing protein structures. The combined score would be used to score the mutations in the sequence of the protein of interest, according to their propensity of reducing immunogenicity. The top scoring mutations would then be combined in a greedy fashion.

The authors tested their method on fluorescent reporter protein superfolder GFP (sfGFP) and the toxin domain of the cancer therapeutic HA22. In the case of sfGFP the authors targeted the four top-scoring T-cell epitopes. They created eight different proteins designs, out of which all preserved the function of the original protein (fluorescence). The authors selected the top scoring design for experimental immunogenicity testing. The experiments have shown that the selected design had a significantly reduced immunogenicity in comparison to the original protein. In the case study of HA22 the authors created five designs, out of which three displayed cytotoxicities at the same level or higher than the original protein. The two most cytotoxic designs were further characterized experimentally for their propensity to induce immune response. The authors have found that the two mutants elicited a significantly reduced T-cell response.

Figure 1: Reduction of immunogenicity without loss of function. A) Three of the five designs show cytotoxicity at the same level or higher than the original protein. B) Two of the three cytotoxicity-preserving designs show reduced immunogenicity

Overall, this very interesting study showed that computational methods can be successfully used for reducing immunogenicity of protein therapeutics, opening new avenues for computational protein design.

Computationally designing antibodies using a known binding motif

This blogpost is be about the “Computational design of an epitope-specific Keap1 binding antibody using hotspot residues grafting and CDR loop swapping” by Liu et. al. that I presented at group meeting in May.

Antibody design is a subject that I am closely interested in, especially methods that have an important computational step. So far the go-to methods for designing an antibody used by industry are animal model immunisation and/or phage display, with little or no use of computational methods. In the past few years, however, a few computational methods for rational design of antibodies have been making a showing. Firstly, there are the ones where a structure of the docked antibody-antigen already exists, and the antibody is further refined computationally to increase binding affinity. Then there are the ones where the paratope of the antibody is proposed by the designer against a specific target. The paper I am summarizing here by Liu et. al follows the latter idea in a neat way.

Liu et. al. show that if a specific motif is important for binding a certain target, i.e. there is a crystal structure which shows that the motif is buried in the target and/or you predict that its residues are important for binding, it is worthwhile trying to graft that that motif in the CDR area of antibody (the one which is responsible for antibody specificity and affinity). Grafting of entire CDR loops has been long used for antibody humanisation, with many examples where CDR loops maintaining conformation and binding specificity when being transferred from a non-human scaffold to a human scaffold. This is somewhat aided by the fact that the starting and end points of the area being grafted is stable (i.e. the anchors are conformationally the same in all the antibody structures that we observe), which is not the case in Liu et al where they graft a four residue motif. The cool thing they do which makes it more probable for the motif to maintain conformation is identify an antibody which has in one of its CDR loops a fragment with the same backbone conformation with the motif they are trying to graft. They then just replace the residue types to the ones that are known to bind the target. For the Nrf2 motifs (that binds Keap1) they managed to create 5 potential designs. These were further expanded, using rational point mutations on the rest of the antibody in order to increase possibility of binding, to 10. Out of the 10 two showed binding.

One of the potential issues in a real scenario however is the fact that not an entire binding site is copied on antibody, the motif being a subset of the whole, which means the possibility of a low affinity and/or low chances of competing with the original protein (i.e. Nrf2) from which the motif was copied. This actually turned out to be the case, with the initial designs showing low mM affinity. Liu et. al. further worked on improving the initial designs, and they did so by computationally swapping the H3 CDR of the initial designs to a set of other H3 structures that have been seen in other solved antibodies using the Rosetta design protocol. They retained the ones that had a predicted buried SASA of > 2000 A^2, a change in energy of more than 20 REU and a shape complementarity greater than 0.6. These were then tested experimentally with a few of them showing nM affinities, a result which at this time should make you very happy if your entire design phase was done computationally.

Using bare git repos

Git is a fantastic method of doing version control of your code. Whether it’s to share with collaborators, or just for your own reference, it almost acts as an absolute point of reference for a wide variety of applications and needs. The basic concept of git is that you have your own folder (in which you edit your code, etc.) and you commit/push those changes to a git repository. Note that Git is a version control SYSTEM and GitHub/BitBucket etc. are services that host repositories using Git as its backend!

The basic procedure of git can be summarised to:

1. Change/add/delete files in your current working directory as necessary. This is followed by a git add or git rm command.
2. “Commit” those changes; we usually put a message reflecting the change from step 1. e.g. git commit -m "I changed this file because it had a bug before."
3. You “push” those changes with git push to a git repository (e.g. hosted by BitBucket, GitHub, etc.); this is sort of like saying “save” that change.

Typically we use services like GitHub to HOST a repository. We then push our changes to that repository (or git pull from it) and all is good. However, a powerful concept to bear in mind is the ‘bare’ git repository. This is especially useful if you have code that’s private and should be strictly kept within your company/institution’s server, yet you don’t want people messing about too much with the master version of the code. The diagram below makes the bare git repository concept quite clear:

The bare repo acts as a “master” version of sorts, and every other “working”, or non-bare repo pushes/pulls changes out of it.

Let’s start with the easy stuff first. Every git repository (e.g. the one you’re working on in your machine) is a WORKING/NON-BARE git repository. This shows files in your code as you expect it, e.g. *.py or *.c files, etc. A BARE repository is a folder hosted by a server which only holds git OBJECTS. In it, you’ll never see a single .py or .c file, but a bunch of folders and text files that look nothing like your code. By the magic of git, this is easily translated as .py or .c files (basically a version of the working repo) when you git clone it. Since the bare repo doesn’t contain any of the actual code, you can safely assume that no one can really mess up with the master version without having gone through the process of git add/commit/push, making everything documented. To start a bare repo…

# Start up a bare repository in a server
user@server:$~  git init --bare name_to_repo.git

# Go back to your machine then clone it
user@machine:$~ git clone user@server:/path/to/repo/name_to_repo.git .

# This will clone a empty git repo in your machine
cd name_to_repo
ls
# Nothing should come up.

touch README
echo "Hello world" >> README
git add README
git commit -m "Adding a README to initialise the bare repo."
git push origin master # This pushes to your origin, which is user@server:/path/to/repo/name_to_repo.git

If we check our folders, we will see the following:

user@machine:$~ ls /path/to/repo
README # only the readme exists

user@server:$~ ls /path/to/repo/name_to_repo.git/
branches/ config description HEAD hooks/ info/ objects/ refs/

Magic! README isn’t in our git repo. Again, this is because the git repo is BARE and so the file we pushed won’t exist. But when we clone it in a different machine…

user@machine2:$~ git clone user@server:/path/to/repo/name_to_repo.git .
ls name_to_repo.git/
README
cat README
Hello world #magic!

This was a bit of a lightning tour but hopefully you can see that the purpose of a bare repo is to allow you to host code as a “master version” without having you worry that people will see the contents directly til they do a git clone. Once they clone, and push changes, everything will be documented via git, so you’ll know exactly what’s going on!

Experimental Binding Modes of Small Molecules in Protein-Ligand Docking

Protein-ligand docking tends to be very good at generating binding modes that resemble experimental binding modes from X-ray crystallography and other methods (assuming we have a high quality structure…); but it is also very good at generating plausible models for ligands that don’t bind. These so-called “false positives” lead to reduced accuracy in structure-based virtual screening campaigns.

Structure-based methods are not the only way of approaching virtual screening: when all we know is the chemical structure of an active molecule, but nothing about its target (or targets), we can use ligand-based virtual screening methods, which operate on the principle of molecular similarity (Maggiora et al., 2014).

But what if we combine both methods?

Continue reading →

A brief history of usage of the word “decoy” in protein structure prediction

Some concepts in science are counter-intuitive, like the Monty Hall problem or the Mpemba effect. Occasionally, this is also true for terminology, despite the best efforts of scientists to ensure that their work can be explained unambiguously to newcomers. Specifically, in our field of protein structure prediction, the word “decoy” has been used to mean one of many conformations generated by a de novo modelling protocol such as Rosetta, or alternative conformations of loops produced by an ab initio program e.g. Sphinx. Though slightly baffled by this usage when I started working in the field, I have now become so familiar with its strange new meaning that I have to remind myself to explain it in talks to a more general audience, or simply aim to avoid the term altogether. Nonetheless, following a heated discussion over the term in a recent group meeting, I thought it would be interesting to trace the roots of the new meaning.

Let’s begin with a definition from Google:

decoy

noun

noun: decoy; plural noun: decoys

/ˈdiːkɔɪ,dɪˈkɔɪ/

a bird or mammal, or an imitation of one, used by hunters to attract other birds or mammals.

“a decoy duck”

a person or thing used to mislead or lure someone into a trap.

“we need a decoy to distract their attention”

So we start with the idea of something distracting, resembling the true thing but with the intent to deceive. So how has this sense of the word evolved into what we use now? I attempted to dig out the earliest mention of decoy for a computationally generated protein conformation with a Google scholar search for “decoy protein”, which led to the work of Thomas and Dill published in 1996. Here the authors describe a method of distinguishing the native fold of a protein from the sequence threaded, without gaps, onto alternative structures from the PDB. This problem of discrimination between native and non-native had been carried out previously, but Thomas and Dill chose to describe the alternatives as “decoy conformations” or just “decoys”.

A similar problem was commonly attempted over the following years, of separating native structures from sets of computationally generated conformations. Due to the demands of conformer generation at this time, some sets were published themselves in online databases to be used as a resource for training scoring functions.

When it comes to the problem of de novo protein structure prediction, unfortunately it isn’t as simple as picking out the correct answer from a population of incorrect answers. Even among hundreds of thousands of conformations generated by the best methods, the exact native crystal structure will not be found (though a complication here that the protein is dynamic and will occupy an ensemble of native conformations). Therefore, the aim of any scoring function in structure prediction is instead to select which incorrect conformation is closest to the native structure, hoping to obtain at least the correct fold.

It is for this reason that we move towards the idea of choosing a model from a pool of decoys. Zhu et al. (2003) use “decoy” in precisely this way:

“One strategy for ab initio protein structure prediction is to generate a large number of possible structures (decoys) and select the most fitting ones based on a scoring or free energy function”

This seems to be where the idea of a decoy as incorrect and distracting is lost, and takes on its new meaning as one of a large and diverse set of protein-like conformations, which has continued until now.

So is it ever helpful to refer to “decoys” as opposed to “models”? What is communicated by “decoy” that is not achieved by using the word “model”? I think this may come down to the impression which is given by talking about a pool of decoys. People would not generally assume that each decoy on its own has any effective use for prediction of function. There is a sense that this is not the final result of the structure prediction pipeline, there is work yet to be done in refining, clustering, and making human judgments on the suitability of the output. Only after these stages would I feel more comfortable using the word “model”, to express the greater confidence we have in the structure (small though that may be in the de novo structure prediction world). However, the inadequacy of “model” does not alone justify this tenuous usage of “decoy”. Perhaps we could speak more often about populations of “conformations”. In any case, “decoy” is widespread in the community, and easily understood by those who are most likely to be reading, reviewing and editing the literature so I think we will be stuck with it for a while yet.

Conformational diversity analysis reveals three functional mechanisms in proteins

Conformational diversity analysis reveals three functional mechanisms in proteins

This paper was published recently in Plos Comp Bio and looks at the conformational diversity (flexibility) of protein structures by comparing solved structures of identical sequences.

The premise of the work is that different crystal structures of the same protein represent instances of the conformational space of the protein. These different instances are identical in amino acid sequence but often differ in other ways they could come from different crystal forms or the protein could have different co-factors bound or have undergone post translational modifications.

The data set used in the paper came from CoDNaS (conformational diversity of the native state) Database URL: http://ufq.unq.edu.ar/codnas.

Only structures solved using X-ray crystallography to a resolution better than 2.5A were used and only proteins for which at least 5 conformers were available (average of 15.53 conformers per protein). Just under 5000 different protein chains made up the set. In order to describe the protein chains the measure used was maximum conformational diversity (the maximum RMSD between any of the conformers of a given protein chain).

The authors describe a relationship between this maximum conformational diversity and the presence absence of intrinsically disordered regions (IDRs). An IDR was defined as a segment of at least 5 contiguous residues with missing electron density (the first and last 20 residues of the chain were not included).

The proteins were divided into three groups.

Rigid

No IDRS

Partially disordered

IDRs in at least one conformer
IDR in the maximum RMSD pair of conformational diversity

Malleable

IDRs in at least one conformer
No IDR in the maximum RMSD pair of conformational diversity

Rigid proteins have in general lower conformational diversity than partially disordered than Malleable. The authors describe how these differences are not due to crystallographic conditions, protein length, number of crystal contacts or number of conformers.

The authors then go on to compare other properties based on these three types of protein chains including amino acid composition, loop RMSD and cavities and tunnels.

They summarise their findings with the figure below.

Interesting Antibody Papers

Here we highlight two antibody papers, one from the past one more recent. The more recent one talks about developing an affinity maturation model. The older one is a refresher on the Developability Index — how to computationally harness hydrophobicity and accessible surface areas to predict aggregation.

Mouse antibody maturation model — the most expanded (common) clones might not be the ones with highest affinities here (van Kampen lab). The authors of the paper define a model of affinity maturation. The main take-home message of the paper is that the ‘most expanded’ clones might not be the ones with highest affinity — expanded clones are assumed to be the ones ‘responding’ to the antigenic challenge. The model is based on Ordinary Differential Equations, tracing cell fate in a germinal center. The model was compared to experimental expansion data from lymph nodes for accuracy. In each such model one needs to assume a lot of parameters, such as which day post-immunization do we start somatic hypermuatation? The paper is a very nice example of a model of maturation and a good starting point for tracing references citing germinal center biology and numbers for parameters used for models (also the general canon of construction of such models!).

Developability index here. (Trout lab at MIT). The authors touch on a very important subject of antibody developability: after you produced your ab binder, does it have physicochemical characteristics which are suitable to carry on with it as a therapeutic. Such characteristics include stability, expression yields and aggregation propensity. Aggregation propensity is one of the most important factors here as it affects the pharmacokinetics of the drug as well as shelf life. In this manuscript, authors address attempt to predict the aggregation propensity of antibodies. As background data, they use twelve antibodies whose long term stability has been measured over several years. To develop a computational method to predict antibody aggregation propensity, they use a score which combines hydrophobicity and electrostatic factors. The hydrophobicity is an adapted SAP score which the authors developed previously, and whose main parameters are the exposed residue area and hydrophobicity of the residue as defined by Black and Mould. The electrostatics are calculated using PROPKA. Since combining the scores into a predictive model involved parametrization, they use seven of the antibodies to adjust the coefficients. They use the rest to demonstrate that their model has predictive power. Calculation of their models requires a structure of an antibody which they obtain using WAM. Take home messages? It is a nice dataset to play with aggregation prediction and it demonstrates how to calculate electrostatics and hydrophobicity of a molecule.

Protein structure determination using metagenome sequence data

For this week’s journal club, I presented a recent paper from Ovchinnikov, and the David Baker group – Protein structure determination using metagenome sequencing data. This discussed how incorporating metagenome sequence data into multiple sequence alignments, can assist with and improve residue-residue contact prediction. The paper concludes with the prediction of over 600 structures from protein families that currently have no solved structures.

The Pfam database contains 14,849 protein families with 50 or more residues. However, only 4752 of these families have at least one member with an experimentally determined structure. 3984 of the remaining 10,097 families have reliable comparative models built on the basis of homologs of known structure. Less confident comparative models can be built for a further 902 families, however this leaves 5211 families with no structural information.

The recent technological advances in genome sequencing have provided an increasingly large number of amino acid sequences to work with. Large numbers of sequences allows the identification of compensatory mutations that have occurred in residues that are in contact with each other. This is called evolutionary co-variance and can allow the relatively accurate prediction of residues that are in contact in a structure. Rosetta utilises these co-evolutionary couplings, along with partial structural matches (found by combining the predicted contacts with contact patterns of known structures, using the map_align algorithm ) to predict structures from a number of families with fold-level accuracy ( TM-score > 0.5 ). However, it was unknown if this method could be used to accurately predict protein structures on a large-scale.

One challenge in using co-evolutionary couplings to predict residue-residue contacts is that a large number of sequences (hundreds to thousands) are needed. The accuracy of the predicted contacts is also dependent on the diversity of the sequences in a family, and the length of the protein. N_f is a measure that incorporates all of these factors :

Figure 1A shows the dependence of Rosetta structure prediction accuracy on the N_f. In general, where N_f≥64, accuracy typical of comparative modelling (TM-score > 0.7) can be achieved. For N_f≥32, fold-level accuracy (TM-score > 0.5) can be achieved, below this, accuracy falls off. Of the 5211 families with no structural information, only ~400 of these had N_f≥64; therefore accurate structural modelling could not be achieved for the remaining ~4800 of these families using the sequencing data available on UniRef100.

Fig 1. (a) Accuracy of predicted structures produced with and without refinement by Rosetta for families with different Nf values. (b) Number of protein families with Nf≥64 between 2009 and 2015 using UniRef100 database, and UniRef100 and Metagenome data. (c) Percentage of protein families with Nf scores 4, 8, 16, 32, and 64 including sequences from UniRef100 and metagenome data.

The addition of metagenome sequence data (from shotgun sequencing microbial DNA from environmental samples) increased the proportion of families with N_f≥64 from 0.08, to 0.25. The proportion of families with N_f≥32 also increased from 0.16, to 0.33. The difference in the fraction of protein families with N_f≥64 before and after the addition of metagenome sequence data can be seen in Figure 1B, and Figure 1C shows the percentage of families with N_f scores above 4, 8, 16, 32 and 64.

After running a set of benchmark calculations, this larger set of sequence data were used to generate models for 921 protein families, which now had N_f≥64 and also had number of long range contacts greater than half the number of residues in the protein. Of these 921 protein families, models with predicted TM scores > 0.65 were generated for 614 families. Although these were only predicted TM scores, crystal structures for members of 5 of the 614 families have since been published and had a TM-score > 0.7 when compared with the corresponding model.

Limitations with this using this data include the lack of eukaryotic genetic information currently, as well as the lack of explicit modeling of ligands, co-factors and lipids using the Rosetta workflow. However, the fast rate of increase in metagenome sequencing data (as compared to the rate of increase of sequencing data in UniRef100) means that while these new models fill roughly 12% of the unknown structural information for protein families, the potential for future structural prediction is bright.

Bitbucket and PyCharm – Tools to make a DPhil less problematic

I find Git a wonderful tool for my work, with version control providing much needed damage control to my projects. I also find Git incredibly powerful at making my working life easier, with the ability to use git push and git pull to synchronise my code between the various computers that I use for my DPhil. Via a BitBucket account, providing a remote Git repository, I am able to move my code around to wherever I am working and allow more room for either more procrastination or staring at my screen in confusion.

As simple as GIT is, it can be a fiddle entering the git commands in command line as well as remembering to do this as you rush to leave the building. This has all been made much easier with PyCharm, from JetBrains. This IDE (integrated development environment) has many tools including version control such as support for a variety of file types, PEP8 checks to ensure good quality code and its ability to work with ipython notebooks.

I’ve put the following mini-tutorial together for those who want to make or bring in an existing repository to PyCharm and get version control working:

Continue reading →

Colour page counter

So you’ve written the thesis, you’ve been examined, the corrections are done, and now you are left with just wearing the silly clown robes to get a piece of paper with your name on it. However, you’ve been informed that you aren’t allowed to don the silly robes until you print the damn thing (again) and submit it to the Bod to be ignored for generations to come. Oh, and the added bonus is that you have to pay for it. Naturally, you want the high-quality printing and paper to match for the final versions, but it’s all so expensive. At least you can save a few meagre pounds by specifying only the pages for colour printing. Naturally, I decided that I would spend far more time making a script than just counting them myself (which I did anyway to verify it works). Enjoy.

#!/usr/bin/env Rscript


library(data.table)

args <- commandArgs(trailingOnly=TRUE) x <- system(paste("gs -q -o - -sDEVICE=inkcov",args[1]," | awk '{print $1,$2,$3}'"),intern=TRUE) x <- as.data.table(tstrsplit(x,' ')) x[,c("V1","V2","V3"):=.(as.numeric(V1),as.numeric(V2),as.numeric(V3))] print(paste("Colour pages total:",sum(rowSums(x)!=0))) print(paste("Colour pages:", paste(which(rowSums(x) != 0),collapse=', ')))

Oxford Protein Informatics Group

or "OPIG" to friends