Tag Archives: journal club

How to write a review paper as a first year PhD student

As a first year PhD student, it is not an uncommon thing to be asked to write a review paper on your subject area. It is both a great way to get acquainted with your research field and to get the background portion of your thesis completed early. However, it can seem like a daunting task to go from knowing almost nothing about your research field to producing something of interest for experts who have spent years studying your subject matter.

In my first year, I was exactly in this position and I found very little online to help guide this process. Thus, here is my reflective look at writing a review paper that will hopefully help someone else in the future.

Continue reading →

Thinking of going to a conference

As so many members of the group have never attended an in-person conference, I thought it might be worth answering the question “why do people attend conferences?”

First- up, we should remember that flying around the world is not a zero cost to the planet, so all of us lucky enough to be able to travel should think hard every time before we choose to do so.

This means it’s really important to make sure that we know why we are going to any conference and maximise the benefits from attendance. Below are a few things to think about in terms of why you attend a conference and what to do when you are there, but this is definitely not a complete list, more a starter for four.

Continue reading →

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading →

Antibody Binding is Mediated by a Compact Vocabulary of Paratope-Epitope Interactions

While my own research focuses mainly on what happens in an antibody before it binds its antigen, I recently came across a paper by Akbar et al [1] that examines antibody-antigen interactions using an elegant approach to identify a set of structural motifs that antibodies use to interact with their epitopes. Since I am interested in emergent properties that arise when a sequence is mapped onto an antibody structure, this paper was very exciting. I will also shamelessly admit that I’m a sucker for a pretty figure and this paper has many! Regardless, on to the findings!

Example of identified interaction motifs. Figure from Akbar et al, 2021

Continue reading →

It’s been here all along: Analysis of the antibody DE loop

In my work, I mainly look at antigen-bound antibodies and this means a lot of analysing interfaces. Specifically, I spend a lot of my time examining the contributions of complementarity-determining regions (CDRs) to antigen binding, but what about antibodies where the framework (FW) region also contributes to binding? Such structures do exist, and these interactions are rarely trivial. As such, a recent preprint I came across where the authors examined the DE loops of antibodies was a great motivator to broaden my horizons!

Continue reading →

Learning from Biased Datasets

Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system.

While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. AlphaGo/AlphaZero), or (ii) data is so abundant that you’re essentially training on “everything” (e.g. GPT2/3, CNNs trained on ImageNet).

This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you are in one of those rare cases) your data is almost certainly biased – you just may or may not know it.

Continue reading →

Journal club: Human enterovirus 71 protein interaction network prompts antiviral drug repositioning

Viruses are small infectious agents, which possess genetic code but have no independent metabolism. They propagate by infecting host cells and hijacking their machinery, often killing the cells in the process. One of the key challenges in developing effective antiviral therapies is the high mutation rate observed in viral genomes. A way to circumvent this issue is to target host proteins involved in virion assembly (also known as essential host factors, or EHFs), rather than the virion itself.

In their recent paper, Lu Han et al. [1] consider human virus protein-protein interactions in order to explore possible host drug targets, as well as drugs which could potentially be re-purposed as antivirals. Their study focuses on enterovirus 71 (EV71), one of the leading causes of hand, foot, and mouth disease.

Human virus protein-protein interactions and target identification

EHFs are typically detected by knocking out genes in the host organism and determining which of the knockouts result in virus control. Low repeat rates and high costs make this technique unsuitable for large scale studies. Instead, the authors use an extensive yeast two-hybrid screen to identify 37 unique protein-protein interactions between 7 of the 11 virus proteins and 29 human proteins. Pathway enrichment suggests that the human proteins interacting with EV71 are involved in a wide range of functions in the cell. Despite this range in functionality, as many as 17 are also associated with other viruses, either through known physical interactions, or as EHFs (Fig 1).

Fig. 1. Interactions between viral and human proteins (denoted as EIPs), and their connection to different viruses.

One of these is ATP6V0C, a subunit of vacuole ATP-ase. It interacts with the EV71 3A protein, and is a known essential host factor for five other viruses. The authors analyse the interaction further, and show that downregulating ATP6V0C gene expression inhibits EV71 propagation, while overexpressing it enhances virus propagation. Moreover, treating cells with bafilomycin A1, a selective inhibitor for vacuole ATP-ase, inhibits EV71 infection in a dose-dependent manner. The paper suggests that therefore ATP6V0C may be a suitable drug target, not only against EV71, but also perhaps even for a broad-spectrum antiviral. While this is encouraging, bafilomycin A1 is a toxic antibiotic used in research, but not suitable for human or drug use. Rather than exploring other compounds targeting ATP6V0C, the paper shifts focus to re-purposing known drugs as antivirals.

Drug prediction using CMap

A potential antiviral will ideally disturb most or all interactions between host cell and virion. One way to do this would be to inhibit the proteins known to interact with EV71. In order to check whether any known compounds already do so, the authors apply gene set enrichment analysis (GSEA) to data from the connectivity map (CMap). CMap is a database of gene expression profiles representing cellular response to a set of 1309 different compounds. Enrichment analysis of the database reveals 27 potential EV71 drugs, of which the authors focus on the top ranking result, tanespimycin.

Tanespimycin is an orphan cancer drug, originally designed to target tumor cells by inhibiting HSP90. Its complex effects on the cell, however, may make it an effective antiviral. Following their CMap analysis, the authors show that tanespimycin reduces viral count and virus-induced cytopathic effects in a dose-dependent manner, without evidence of cytotoxicity.

Overall, the paper presents two different ways to think about target investigation and drug choice in antiviral therapeutics — by integrating different types of known host virus protein-protein interactions, and by analysing cell response to known compounds. While extensive further study is needed to determine whether the results are directly clinically relevant to the treatment of EV71, the paper shows how interaction data analysis can be employed in drug discovery.

References:

[1] Han, Lu, et al. “Human enterovirus 71 protein interaction network prompts antiviral drug repositioning.” Scientific Reports 7 (2017).

Proteins evolve on the edge of supramolecular self-assembly

Inspired by Eoin’s interesting talks on prions and prion diseases, and Nick’s discussion of how Cyro-Electron microscopy is going to be the end of an era for Crystallography. I thought I’d look at a paper that discusses aggregation of protein complexes, with some cryo-electron microscopy thrown in for good measure.

a, A molecule gaining a single self-interacting patch forms a finite dimer. A self-interacting patch repeated on opposite sides of a symmetric molecule can result in infinite assembly. b, A point mutation in a dihedral octamer creates a new self-interacting patch (red), triggering assembly into a fibre.

Supramolecular assemblies are folded protein complexes forming into much larger units. This formation can be triggered by a mutation on a copy of the constituent homomers of the complex, acting as a self-interacting patch. If this patch were to form in a non-symmetric complex, it would likely form a finite assemble with a limited number of copies of the complex. However, if the complex has dihedral symmetry such that a patch is accessible at multiple separated locations, then complex can potentially form near infinite supramolecular assemblies. Continue reading →

Is contacts-based protein-protein affinity prediction the way forward?

The binding affinity of protein interactions is useful information for a range of protein engineering and protein-protein interaction (PPI) network challenges. Obvious applications include the development of therapeutic antibodies to given drug targets or the engineering of novel interfaces for synthetic protein complexes. An accurate model would furthermore allow us to predict a large proportion of affinities in existing PPI networks, and enable the identification of new PPIs, which is critical for our ability to model protein network dynamics effectively.

“The design of an ideal scoring function for protein−protein docking that would also predict the binding affinity of a complex is one of the challenges in structural proteomics.” Adapted from Kastritis, Panagiotis L., and Alexandre MJJ Bonvin. Journal of proteome research 9.5 (2010): 2216-2225.

In last week’s paper a new binding-affinity prediction method based on interfacial contact information was described. Contacts have long been used to in docking methods but surprisingly this was the first time that binding affinity was predicted with them. Largely, this was due to the lack of a suitable benchmark data set that contained structural as well as affinity data . In 2011, however, Kastritis et al. presented a curated database of 144 non-redundant protein–protein complexes with experimentally determined Kd (ΔG) as well as x-ray structures.
Using this data set they trained and validated their method, compared it against others and concluded that interfacial contacts `can be considered the best structural property to describe binding strength`. This claim may be true but as we discussed in the meeting there is still some work to do before we take this model an run with it. A number of flags were raised:

Classification of experimental methods into reliable and non-reliable is based on what gives the best results with their method. Given that different types of protein complexes are often measured with different methods, some protein classes for which contact-based predictions are less effective may be excluded.
Number of parameters for model 6 is problematic without exact AIC information. As Lyuba righlty pointed out, the intercept in model 6 `explodes`. It is no surprise that the correlation improves with more parameters. Despite their AIC analysis, overfitting is still a worry due to the lack of details presented in the paper.

Comparison against other methods is biased in their favour; their method was trained on the same data set, the others were not. In order to ensure a fair comparison all methods should be trained on the same data set. Of course this is hard to do in practice, but the fact remains that a comparison of methods that has been trained on different data sets will be flawed.

Paper: Vangone, A., Bonvin, A. M. J. J., Alberts, B., Aloy, P., Russell, R., Andrusier, N., … Zhou, Y. (2015). Contacts-based prediction of binding affinity in protein-protein complexes. eLife, 4, e07454. http://doi.org/10.7554/eLife.07454

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts, solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling, homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11). Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture), you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: journal club

How to write a review paper as a first year PhD student

Thinking of going to a conference

Paper review: “EquiBind”

Method

Antibody Binding is Mediated by a Compact Vocabulary of Paratope-Epitope Interactions

It’s been here all along: Analysis of the antibody DE loop

Learning from Biased Datasets

Journal club: Human enterovirus 71 protein interaction network prompts antiviral drug repositioning

Human virus protein-protein interactions and target identification

Drug prediction using CMap

Proteins evolve on the edge of supramolecular self-assembly

Is contacts-based protein-protein affinity prediction the way forward?

Slow and steady improvements in the prediction of one-dimensional protein features