Category Archives: Talks

A short account of the talks given by the OPIG group members and their highly esteemed guests.

GPGPUs for bioinformatics

As the clock speed in computer Central Processing Units (CPUs) began to plateau, their data and task parallelism was expanded to compensate. These days (2013) it is not uncommon to find upwards of a dozen processing cores on a single CPU and each core capable of performing 8 calculations as a single operation. Graphics Processing Units were originally intended to assist CPUs by providing hardware optimised to speed up rendering highly parallel graphical data into a frame buffer. As graphical models became more complex, it became difficult to provide a single piece of hardware which implemented an optimised design for every model and every calculation the end user may desire. Instead, GPU designs evolved to be more readily programmable and exhibit greater parallelism. Top-end GPUs are now equipped with over 2,500 simple cores and have their own CUDA or OpenCL programming languages. This new found programmability allowed users the freedom to take non-graphics tasks which would otherwise have saturated a CPU for days and to run them on the highly parallel hardware of the GPU. This technique proved so effective for certain tasks that GPU manufacturers have since begun to tweak their architectures to be suitable not just for graphics processing but also for more general purpose tasks, thus beginning the evolution General Purpose Graphics Processing Unit (GPGPU).

Improvements in data capture and model generation have caused an explosion in the amount of bioinformatic data which is now available. Data which is increasing in volume faster than CPUs are increasing in either speed or parallelism. An example of this can be found here, which displays a graph of the number of proteins stored in the Protein Data Bank per year. To process this vast volume of data, many of the common tools for structure prediction, sequence analysis, molecular dynamics and so forth have now been ported to the GPGPU. The following tools are now GPGPU enabled and offer significant speed-up compared to their CPU-based counterparts:

Application	Description	Expected Speed Up	Multi-GPU Support
Abalone	Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands	4-29x	No
ACEMD	GPU simulation of molecular mechanics force fields, implicit and explicit solvent	160 ns/day GPU version only	Yes
AMBER	Suite of programs to simulate molecular dynamics on biomolecule	89.44 ns/day JAC NVE	Yes
BarraCUDA	Sequence mapping software	6-10x	Yes
CUDASW++	Open source software for Smith-Waterman protein database searches on GPUs	10-50x	Yes
CUDA-BLASTP	Accelerates NCBI BLAST for scanning protein sequence databases	10	Yes
CUSHAW	Parallelized short read aligner	10x	Yes
DL-POLY	Simulate macromolecules, polymers, ionic systems, etc on a distributed memory parallel computer	4x	Yes
GPU-BLAST	Local search with fast k-tuple heuristic	3-4x	No
GROMACS	Simulation of biochemical molecules with complicated bond interactions	165 ns/Day DHFR	No
GPU-HMMER	Parallelized local and global search with profile Hidden Markov models	60-100x	Yes
HOOMD-Blue	Particle dynamics package written from the ground up for GPUs	2x	Yes
LAMMPS	Classical molecular dynamics package	3-18x	Yes
mCUDA-MEME	Ultrafast scalable motif discovery algorithm based on MEME	4-10x	Yes
MUMmerGPU	An open-source high-throughput parallel pairwise local sequence alignment program	13x	No
NAMD	Designed for high-performance simulation of large molecular systems	6.44 ns/days STMV 585x 2050s	Yes
OpenMM	Library and application for molecular dynamics for HPC with GPUs	Implicit: 127-213 ns/day; Explicit: 18-55 ns/day DHFR	Yes
SeqNFind	A commercial GPU Accelerated Sequence Analysis Toolset	400x	Yes
TeraChem	A general purpose quantum chemistry package	7-50x	Yes
UGENE	Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder and dotplot	6-8x	Yes
WideLM	Fits numerous linear models to a fixed design and response	150x	Yes

It is important to note however, that due to how GPGPUs handle floating point arithmetic compared to CPUs, results can and will differ between architectures, making a direct comparison impossible. Instead, interval arithmetic may be useful to sanity-check the results generated on the GPU are consistent with those from a CPU based system.

Journal club: Simultaneous Femtosecond X-ray Spectroscopy and Diffraction of Photosystem II at Room Temperature

In the last journal club we covered the paper Simultaneous Femtosecond X-ray Spectroscopy and Diffraction of Photosystem II at Room Temperature (Kern et al., 2013), currently still in Science Express.

Structure of Photosystem II, PDB 2AXT
CC BY-SA 3.0 Curtis Neveu

This paper describes an experiment on the Photosystem II (PSII) protein complex. PSII is a large protein complex consisting of about 20 subunits with a combined weight of ca. 350 kDa. As its name suggests, this complex plays a crucial role in photosynthesis: it is responsible for the oxidation (“splitting up”) of water.

In the actual experiment (see the top right corner of Figure 1 in the paper) three experimental methods are combined: PSII microcrystals (5-15µm) are injected from the top into the path of an X-ray pulse (blue). Simultaneously, an emission spectrum is recorded (yellow, detector at the bottom). And finally in a separate run the injected crystals are treated (‘pumped’) with a visible laser (red) before they hit the X-ray pulse.

Let’s take a look at each of those three in a little more detail.

X-ray diffraction (XRD)

In a standard macromolecular X-ray crystallography experiment a crystal containing identical molecules of interest (protein) at regular, ordered lattice points is exposed to an X-ray beam. Some X-rays are elastically scattered and cause a diffraction pattern to form on a detector. By analysing the diffraction patterns of a rotating crystal it is possible to calculate the electron density distribution of the molecule in question, and thus determine its three dimensional structure.

An intense X-ray beam however also systematically damages the sample (Garman, 2010). For experiments using in-house X-ray generators or synchrotrons it is therefore recommended not to exceed a total dose of 30 MGy on any one crystal (Owen et al., 2006).

Aerial view of the LCLS site. The accelerator is 3.2 kilometres long.

The experiment described in the current paper however was not conducted using a run-of-the-mill synchrotron, but with the Linac Coherent Light Source (LCLS), an X-ray free electron laser. Here each diffraction image results from an individual crystal exposed to a single X-ray pulse of about 50 femtoseconds, resulting in a peak dose of ~150 MGy. Delivering these extreme doses in very short X-ray pulses lead to the destruction of the sample via a coulomb explosion.

As the sample is destroyed only one diffraction image can be taken per crystal. This causes two complications: Firstly, a large number of sample crystals are required. This explains why the experiment required a total beam time of 6 hours and involved over 2.5 million X-ray shots and processing an equal number of diffraction images. Secondly, the resulting set of (usable) diffraction images are an unordered, random sampling of the crystal orientation space. This presents a computational challenge unique to XFEL setups: before the diffraction data can be integrated, the orientation of the crystal lattice needs to be determined for each individual diffraction image.

X-ray diffraction allows us to obtain a electron density map of the entire unit cell, and therefore the entire crystal. But we can only see ordered areas of the crystal. To really see small molecules in solvent channels, or the conformations of side chains on the protein surface they need to occupy the same positions within each unit cell. For most small molecules this is not the case. This is why you will basically never see compounds like PEGs or Glycerol in your electron density maps, even though you used them during sample preparation. For heavy metal compounds this is especially annoying. They disproportionately increase the X-ray absorption (Holton, 2009; with a handy table 1) and therefore shorten the crystal’s lifetime. But they do not contribute to the diffraction pattern (And that is why you should back-soak; Garman & Murray, 2003).

X-ray emission spectroscopy (XES)

PSII contains a manganese cluster (Mn₄CaO₅). This cluster, as with all protein metallocentres, is known to be highly radiation sensitive (Yano et al., 2005). Apparently at a dose of about 0.75 MGy the cluster is reduced in 50% of the irradiated unit cells.

Diffraction patterns represent a space- and time-average of the electron density. It is very difficult to quantify the amount of reduction from the obtained diffraction patterns. There is however a better and more direct way of measuring the states of the metal atoms: X-ray emission spectroscopy.

Basically the very same events that cause radiation damage also cause X-ray emissions with very specific energies, characteristic for the involved atoms and their oxidation state. An Mn^IVO₂ energy spectrum looks markedly different to an Mn^IIO spectrum (bottom right corner of Figure 1).

The measurements of XRD and XES are taken simultaneously. But, differently to XRD, XES does not rely on crystalline order. It makes no difference whether the metallocentres move within the unit cell as a result of specific radiation damage. Even the destruction of the sample is not an issue: we can assume that XES will record the state of the clusters regardless of the state of the crystal lattice – up to the point where the whole sample blows up. But at that point we know that due to the loss of long-range order the X-ray diffraction pattern will no longer be affected.

The measurements of XES therefore give us a worst-case scenario of the state of the manganese cluster in the time-frame where we obtained the X-ray diffraction data.

Induced protein state transition with visible laser

Another neat technique used in this experiment is the activation of the protein by a visible laser in the flight path of the crystals as they approach the X-ray interaction zone. With a known crystal injection speed and a known distance between the visible laser and the X-ray pulse it becomes possible to observe time-dependent protein processes (top left corner of Figure 1).

What makes this paper special? (And why is it important to my work?)

It has been theorized and partially known that the collection of diffraction data using X-ray free electron lasers outruns traditional radiation damage processes. Yet there was no conclusive proof – up until now.

This paper has shown that the XES spectra of the highly sensitive metallocentres do not indicate any measurable (let alone significant) change from the idealized intact state (Figure 3). Apparently the highly sensitive metallocentres of the protein stay intact up to the point of complete sample destruction, and certainly well beyond the point of sample diffraction.

Or to put another way: even though the crystalline sample itself is destroyed under extreme conditions, the obtained diffraction pattern is – for all intents and purposes – that of an intact, pristine, zero dose state of the crystal.

This is an intriguing result for people interested in macromolecular crystallography radiation damage. In a regime of absurdly high doses, where the entire concept of a ‘dose’ for a crystal breaks down (again), we can obtain structural data with no specific radiation damage present at the metallocentres. As metallocentres are more susceptible to specific radiation damage it stands to reason that the diffraction data may be free of all specific damage artefacts.

It is thought that most deposited PDB protein structures containing metallocentres are in a reduced state. XFELs are being constructed all over the world and will be the next big thing in macromolecular crystallography. And now it seems that the future of metalloproteins got even brighter.

Journal Club: The complexity of Binding

Molecular recognition is the mechanism by which two or more molecules come together to form a specific complex. But how do molecules recognise and interact with each other?

In the TIBS Opinion article by Ruth Nussinov group, an extended conformational selection model is described. This model includes the classical lock-and-key, induced fit, conformational selection mechanisms and their combination.

The general concept of equilibrium shift of the ensemble was proposed nearly 15 years ago, or perharps earlier. The basic idea is that proteins in solution pre-exist in a number of conformational substates, including those with binding sites complementary to a ligand. The distribution of the substates can be visualised as free energy landscape (see figure above), which helps in understanding the dynamic nature of the conformational equilibrium.

This equilibrium is not static, it is sensitive to the environment and many other factors. An equilibrium shift can be achieved by (a) sequence modifications of special protein regions termed protein segments, (b) post-translational modifications of a protein, (c) ligand binding, etc.

So why are these concepts discussed and published again?

While the theory is straight-forward, proving conformational selection is hard and it is even harder to quantify it computationally. Experimental techniques such Nuclear Magnetic Resonance (NMR), single molecule studies (e.g. protein yoga), targeted mutagenesis and its effect on the energy landscape, plus molecular dynamics (MD) simulations have been helping to conceptualise conformational transitions. Meanwhile, there is still a long way to go before a full understanding of atomic scale pathways is achieved.

Talk: Membrane Protein 3D Structure Prediction & Loop Modelling in X-ray Crystallography

Seb gave a talk at the Oxford Structural Genomics Consortium on Wednesday 9 Jan 2013. The talk mentioned the work of several other OPIG members. Below is the gist of it.

Homology modelling pipeline with several membrane-protein-specific steps. Input is the target protein’s sequence, output is the finished 3D model.

Fragment-based loop modelling pipeline for X-ray crystallography

Given an incomplete model of a protein, as well as the current electron density map, we apply our loop modelling method FREAD to fill in a gap with many decoy structures. These decoys are then scored using electron density quality measures computed by EDSTATS. This process can be iterated to arrive at a complete model.

Over the past five years the Oxford Protein Informatics Group has produced several pieces of software to model various aspects of membrane protein structure. iMembrane predicts how a given protein structure sits in the lipid bilayer. MP-T aligns a target protein’s sequence to an iMembrane-annotated template structure. MEDELLER produces an accurate core model of the target, based on this target-template alignment. FREAD then fills in the remaining gaps through fragment-based loop modelling. We have assembled all these pieces of software into a single pipeline, which will be released to the public shortly. In the future, further refinements will be added to account for errors in the core model, such as helix kinks and twists.

X-ray crystallography is the most prevalent way to obtain a protein’s 3D structure. In difficult cases, such as membrane proteins, often only low resolution data can be obtained from such experiments, making the subsequent computational steps to arrive at a complete 3D model that much harder. This usually involves tedious manual building of individual residues and much trial and error. In addition, some regions of the protein (such as disordered loops) simply are not represented by the electron density at all and it is difficult to distinguish these from areas that simply require a lot of work to build. To alleviate some of these problems, we are developing a scoring scheme to attach an absolute quality measure to each residue being built by our loop modelling method FREAD, with a view towards automating protein structure solution at low resolution. This work is being carried out in collaboration with Frank von Delft’s Protein Crystallography group at the Oxford Structural Genomics Consortium.