Citing R packages in your Thesis/Paper/Assignments

If you need to cite R, there is a very useful function called citation().

> citation()

To cite R in publications use:

  R Core Team (2013). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria.
  URL http://www.R-project.org/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2013},
    url = {http://www.R-project.org/},
  }

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also ‘citation("pkgname")’ for
citing R packages.

If you want to cite just a package, just pass the package name as a parameter, e.g.:

> citation(package = "cluster")

To cite the R package 'cluster' in publications use:

  Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik,
  K.(2013).  cluster: Cluster Analysis Basics and Extensions. R package
  version 1.14.4.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cluster: Cluster Analysis Basics and Extensions},
    author = {Martin Maechler and Peter Rousseeuw and Anja Struyf and Mia Hubert and Kurt Hornik},
    year = {2013},
    note = {R package version 1.14.4 --- For new features, see the 'Changelog' file (in the package source)},
  }

This will give you BibTeX entries you can copy and paste in your BibTeX reference.  If you are using M$ Word, good luck to you.

[Publication] Cloud computing in Molecular Modelling – a topical perspective

cloud_computingtoc

My ex-InhibOx colleagues (Simone Fulle, Garrett Morris, Paul Finn) and myself have recently published a topical review on “The emerging role of cloud computing in molecular modelling” in the Journal of Molecular Graphics and Modelling.   This paper starts with a gentle and in-depth introduction to the field of cloud computing.  The second part of the paper is how it applies to molecular modelling (and the sort of tasks we can run in the cloud).  The third and last part presents two practical case studies of cloud computations, one of which describes how we built a virtual library to use in virtual screening on AWS.

We hope that after reading this article the cloud will become a less nebulous affair! *pun intended*

As an addendum, I recently came across this paper “Teaching cloud computing: A software engineering perspective” (2013) on how to teach cloud computing at a graduate level.  This work is relevant, because lots of universities are presently including cloud computing in their curricula.

 

Making Protein-Protein Interfaces Look (decently) Good

This is a little PyMOL script that I’ve used to draw antibody-antigen interfaces. If you’d like a commented version on what each and every line does, contact me! This is a slight modification of what has been done in PyMOL Wiki.

load FILENAME
set_name FILENAME, complex	

set bg_rgb, [1,1,1]  	

color white 	     		

hide lines
show cartoon

select antibody, chain a
select antigen, chain b

select paratopeAtoms, antibody within 4.5 of antigen 
select epitopeAtoms, antigen within 4.5 of antibody

select paratopeRes, byres paratopeAtoms
select epitopeRes, byres epitopeAtoms

distance interactions, paratopeAtoms, epitopeAtoms, 4.5, 0

color red, interactions
hide labels, interactions

show sticks, paratopeRes
show sticks, epitopeRes

set cartoon_side_chain_helper, on

set sphere_quality, 2
set sphere_scale, 0.3
show spheres, paratopeAtoms
show spheres, epitopeAtoms
color tv_blue, paratopeAtoms
color tv_yellow, epitopeAtoms

set ray_trace_mode, 3
unset depth_cue
set specular, 0.5

Once you orient it to where you’d like it and ray it, you should get something like this.
contacts

Building an Antibody Benchmark Set

In this so-called ‘big data’ age, the quest to find the signal amidst the noise is becoming more difficult than ever. Though we have sophisticated systems that can extract and parse data incredibly efficiently, the amount of noise has equally, if not more so, expanded, thus masking the signals that we crave for. Oddly enough, it sometimes seems that we are churning and gathering a vast amount data just for the sake of it, rather than looking for highly-relevant, high-quality data.

One such example is antibody (Ab) binding data. Even though there are several Ab-specific databases (e.g. AbySis, IMGT), none of these, to our knowledge, has any information on an Ab’s binding affinity to its antigen (Ag), despite the fact that an Ab’s affinity is one of the few quantitative metrics of its performance. Therefore, gathering Ab binding data would not only help us to create more accurate models of Ab binding, it would, in the long term, facilitate the in silico maturation and design/re-design of Abs. If this seems like a dream, have a read of this paper – they made an incredibly effective Ab from computationally-inspired methods.

Given the tools at our disposal, and the fact that several protein-protein binding databases are available in the public domain, this task may seem somewhat trivial. However, there’s the ever-present issue of gathering only the highest quality data points in order to perform some of the applications mentioned earlier.

Over the past few weeks, we have gathered the binding data for 228 Ab-Ag complexes across two major protein-protein binding databases; PDB-Bind and the structure-based benchmark from Kastritis et al. Ultimately, 36 entries were removed from further analyses as they had irrelevant data (e.g. IC50 instead of KD; IC50 relates to inhibition, which is not the same as the Ab’s affinity for its Ag). Given the dataset, we performed some initial tests on existing energy functions and docking programs to see if there is any correlation between the programs’ scores and protein binding affinities.

Blue = Abs binding to proteins, Red = Abs binding to peptides

Blue = Abs binding to proteins, Red = Abs binding to peptides

As the graphs show, there is no distinctive correlation between a program/function’s score and the affinity of an Ab. Having said this, these programs were trained on general protein-protein interfaces (though that does occasionally include Abs!) and we thus trained DCOMPLEX and RAPDF specifically for Ab structures (~130 structures). The end results were poor nonetheless (top-centre and top-right graphs, above), but the interatomic heatmaps show clear differences in the interaction patterns between Ab-Ag interfaces and general protein-protein interfaces.

Interatomic contact map between Ab-Ag or two general proteins. Warmer colours represent higher counts.

Interatomic contact map between Ab-Ag or two general proteins. Warmer colours represent higher counts.

Now, with this new information, the search for signals continues. It is evident that Ab binding has distinctive differences with respect to protein-protein interfaces. Therefore, the next step is to gather more high-quality data and see if there is any correlation between an Ab’s distinct binding mode and its affinity. However, we are not interested in just getting whatever affinity data is available. As we have done for the past few weeks, the rigorous standards we have used for building the current benchmark set must be maintained – otherwise we risk in masking the signal with unnecessary noise.

Currently, the results are disappointing, but if the past few weeks in OPIG has taught me anything, this is only the beginning of a long and difficult search for a good model. BUT – this is what makes research so exciting! We learn from the low Pearson correlation coefficients, the (almost) random distribution of data, and the not-so-pretty plots of our data in order to form useful models for practical applications like Ab design. I think a quote from The Great Gatsby accurately ‘models’ my optimism for making sense of the incoming stream of data:

Gatsby believed in the green light, the orgastic future that year by year recedes before us. It eluded us then, but that’s no matter — to-morrow we will run faster, stretch out our arms farther. . . . And one fine morning ——

So we beat on, boats against the current, borne back ceaselessly into the past.

Evolutionary fold space preferences

At group meeting last week I focussed, alongside some metaphysical speculation, on a project which has occupied the first half of my DPhil: namely exploring the preferences of both very old and very young protein structures. This work is currently in preparation for publication so I will give only a brief overview and hopefully update the juicy details later. Feel free to contact me for more information.

Proteins are the molecular machinery of the cell. Their evolution is one of the most fundamental processes which has delivered the diversity and complexity of life that we see around ourselves today. Despite this diversity, protein domains (independent folding units) of known structure fall into just over 1,000 unique SCOP folds.

This project has sought to identify how populations of proteins at different stages of evolution explore their possible structure space.

Superfamily ages

Structural domains are clustered at different levels of similarity within the SCOP classification. At the superfamily level this classification attempts to capture evolutionary relationships through structural and functional similarities even if sequence diversion has occurred.

Evolutionary ages for these superfamilies are then estimated from their phylogenetic profiles across the tree of life. These ages are an estimate of the structural ancestor for a superfamily.

 

how_old

 

The phylogenetic occurrence profiles are constructed using predictions of superfamilies on completely sequenced genomes using HMMs and taken from the SUPERFAMILY database. Given an occurrence profile and a phylogenetic tree (for robustness we consider several possible reconstructions of the tree of life) we use a maximum parsimony algorithm (proposed by Mirkin et. al) which estimates the simplest scenario of loss events (domain loss on a genome) and gain events (domain gain) at internal nodes on the tree which explains the occurrence profile. The age estimate is the height of the first gain event, normalised between 0 (at the leaves of the tree) and 1 (at the root).

We estimated ages for 1,962 SCOP superfamilies and compared several properties relating to their primary, secondary and tertiary structures, as well as their functions. In particular, we compared two populations of superfamilies: ancients, with an age of 1, and new-borns, with an age < 0.4. Full details of our results will hopefully be published shortly so watch this space!

Antimicrobial Drug Discovery Conference (Madrid)

I am a big fan of taking something, either a poster or a talk, to a conference, and getting something back – other than a €6 box of airport chocolates.  This blog post is in that spirit.

On the plane to the “Antimicrobial Drug Discovery” conference in Madrid I was reading the Cassandra Project (a novel on smallpox, how apt) instead of the stack overflow of scientific papers I planned to read.  Classic JP.

The conference had a mix of experienced, invited speakers and early stage researchers.  It was very “biological” for a computational scientist, so quite removed from what I normally do – but an opportunity to learn nonetheless.

The keynote lecture was by Julian Davies, a fantastic speaker who gave a general overview of antibiotics and antibiotic resistance.  Antibiotic resistance is a real concern (even those politicians in the G8 noticed a few hours ago!) and there is a fear we might return to pre-antibiotics era when you could not cure common diseases like bacterial pneumonia.  Pharmaceutical companies all got out of antibiotic research years ago, and there have been no new antibiotic scaffolds for more than a decade.  I found this surprising as you would think that there was a truckload of money to be made from finding the new penicillin.  Apparently, there is little return in anti-infectives because of rapid mutation of the pathogen and its short-term use (curing the infection, as opposed to having to take your medication for life, such as beta-blockers for hyperventilation).  Bacteria should not only be considered at an individual cell level but also as a population with complex signalling between the individuals (which may offer a way to stop bacterial infection).  In order to combat infections and increasing resistance sick patients are now supplied with combinations of drugs – this is still dangerous due to the possible (toxic) drug-drug interactions.

Natural products, e.g. some toxins, are good antibiotics but it is very hard to optimize such compounds to improve their drug profile (chemical synthesis of natural products is difficult).  Also a lot of people at the conference were talking of how antimicrobial peptides will save the day.  The attendees with drug discovery experience raised an eyebrow about this, knowing how hard it will be to make a 30 residue peptide into a drug.

Some antibiotics work by having a hydrophilic part (e.g. carboxyl) and a hydrophobic part (e.g. an alkane chain).  This hydrophobic part sits in the membrane wall disrupting it, which creates a “leak” from the bacteria which eventually kills the pathogen.  There are other mechanisms of action such as blocking transporter or signalling channels.

There was a brilliant, energetic talk by Bruno Gonzalez-Zorn with the audience paying rapt attention.  He showed how bacteria have these multiple, small plasmids offering antibiotic resistance.  He discovered there was a common two-part theme to antibiotic resistance, where a particular gene is always present.

Paul Finn gave a much needed talk on why drug discovery is hard (e.g. target selection, difficulty to get drugs in therapeutic area, potency, toxicity, have to optimize for different variables, etc.).  Unknowingly proving this point, there was this earlier talk of a whole optimization series which got a small molecule inhibitor of a viral infection from 150uM down to 1uM (IC50) – a great result in itself, and when the investigators tested this ligand in vivo rather than in vitro it simply did not have any affect on the virus.

Cele Abad Zapatero, one of the main investigators of AltasCBS, made the point that, today, we do not know where we are in drug discovery.  He argued we need to move to chemical-biology space instead of simply chemical space and recommended the use of ligand efficiency indices (e.g. BEI, SEI).

Having fun in Madrid

Madrid was way too much fun.  Zidane (and a few thousand others) kissed this Champions League Cup in exactly the same place. Talking about microbes.  (click to enlarge)

And what did I take to the conference?  I took a poster, the design of which is based on Dunbar’s stylish template.  Marta, Ana and myself won a “highly commendable” poster prize with the best poster going to Laura (Synthetic inhibitors of bacterial cell division targeting the GTP binding site of FtsZ, since you asked).  There were 24 posters in all, and mine was the only computational study in a room otherwise filled with phages, bacteria and plasmids (literally as well as metaphorically).  There is a sinister heart-warming joy in winning a bottle of wine, instead of a cheque or a certificate.  James deserves a sip or two.

 

Poster Prize Presentation

Cheekily asking for a corkscrew during the poster prize award

 

[Publication] Effect of Single Amino Acid Substitution Observed in Cancer on Pim-1 Kinase Thermodynamic Stability and Structure

In this study we selected point mutations resulting in Pim-1 variants that are expressed in cancer tissues and reported in SNP databases, such as FastSNP and COSMIC. These Pim-1 variants have been comprehensively characterized to investigate the effect of single amino acid substitution on Pim-1 thermal and thermodynamic stability and structure in solution. Our results indicate that the effects of the mutation observed in cancer tissues cause local changes of tertiary structure, but do not affect binding to type I kinase inhibitors.

This work has been pioneered by researches at the Department of Biochemical Sciences “A. Rossi Fanelli”, Sapienza University of Rome and served as an inspiration for one of my thesis chapters.

Standardize a PDB

In many applications you need to constrain PDB files to certain chains. You can do it using this program.

A. What does it do?

Given a pdb file, write out the ATOM and HETATM entries for the supplied chain(s), put them on a single chain with the name provided by the user.

PDB_standardize needs four arguments:

  1. PDB file to constrain.
  2. Chains from the pdb file to constrain.
  3. Output file.
  4. Name of the new chain

B. Requirements:

Biopython – should be installed on your machines but in case you want to use it locally, download the latest version into the PDB_constrain.py’s directory (don’t need to build).

C. Example use:

C.1 Constrain 1A2Y.pdb to chains A and B, placed on a single chain C – write results in constr.pdb

python PDB_standardize.py -f 1A2Y.pdb -c AB -o const.pdb -s C

 

C.2 Constrain 1ACY to chain L, with a new chain name F, write results in const.pdb – this example shows that the constrainer works well with ‘insertion’ residue numbering as in antibodies where you have 27A, 27B etc.

python PDB_standardize.py -f 1ACY.pdb -c L -o const.pdb -s F

 

And the Oscar goes to …

08:44

I have a bet I can write a blog post in under 20 mins.  Last week I asked Leila Tamara where her blog post was (we aim to have one up every week) and she replied that “It was ready but hadn’t been proof-read yet“.  I guess it goes to show the diligence of the students here at Oxford.  (Some of them, anyway).

So as a Marie Curie Stars ITN Fellow, my time is coming to an end (in mid-September).  Part of my D.Phil was about finding novel anti-malarial inhibitors using computational methods (virtual screening).  No, we haven’t cured it yet.

During a conference in Riga we made a video about the project – I find it funny to see myself on screen, (and to hear my heavily accented commentary).  Pity the Oscars have already been dished out this year!

And a “making of” still

JP make-up

08:52.

Warning: this post hasn’t been proof-read yet!

 

Protein kinases, the PIM story

Last week I was presenting my DPhil work. In one of my projects I address the reasons for inhibitor selectivity in PIM protein kinase family. PIM kinases play key roles in signalling pathways and have been identified as oncogenes long time ago. Slightly unusual for protein kinases ATP-binding sites and cancer roles have prompted the investigation of potential PIM-selective inhibitors for anticancer therapy. Due to overlapping functions of the three PIM isoforms, efficacious inhibitors should bind to all three isozymes. However, most reported inhibitors show considerable selectivity for PIM1 and PIM3 over PIM2 and the mechanisms leading to this selectivity remain unclear.

Figure 1. Workflow of the sequence and structure analysis of PIM kinases

Figure 1. Workflow of the sequence and structure analysis of PIM kinases

To establish the sequence determinants of inhibitor selectivity we investigated the phylogenetic relationships of PIM kinases and their structural conformations upon ligand binding (Figure 1). Together with my OPIG supervisor Charlotte Deane we predicted a set of candidates for site-directed mutagenesis as illustrated in Figure 2. The mutants were designed to convert PIM1 residues into analogous PIM2 residues at the same positions.

I then moved to the wetlab to test the hypotheses experimentally. Under guidance of Oleg Fedorov, I screened the SGC library of kinase inhibitors using differential scanning fluorimetry (DSF). After comparing melting temperature shift values across the PIM kinases and mutants, a set of potent inhibitors with different chemical scaffolds have been selected for quantitative binding analysis. I worked with Peter Drueker’s team at Novartis on PIMs enzymology, where I measured activities, Km values for ATP and IC50s using mobility shift assay. For my final set of measurements I performed isothermal titration calorimetry (ITC) experiments back at the SGC and determined binding constants and enthalpic/entropic contributions to the total free energy of ligand binding.

Figure 2. An overlay of PIM1 and PIM2 structures (P-loop and hinge regions), the mutated residues are shown as sticks

Figure 2. An overlay of PIM1 and PIM2 structures (P-loop and hinge regions), the mutated residues are shown as sticks

The data are yet to be published, I only briefly state the results here. The hinge mutant E124L demonstrated reduced thermal stability probably due to removal of E124-R122 salt bridge. The P-loop mutants had intermediate Km ATP values between PIM1 and PIM2, indicating that those residues could be responsible for stronger ATP binding in PIM2. As shown in Figure 2, the residues are located at the tip of the P-loop and might have involvement in the P-loop movement. Importantly, three mutants have shown reduced affinity to inhibitors validating my initial hypotheses.

Ideally having PIM1 and PIM2 co-crystal structures with the same inhibitors would allow direct comparison of the binding modes. So far I was able to solve apo-PIM2 structure in addition to the single PIM2 pdb, which will be deposited shortly.

I will update you soon about on my second project which involves more mutants, type II inhibitors, equilibrium shifts and speculations about conformational transitions. Keep visiting us!