Journal Club: AbDesign. An algorithm for combinatorial backbone design guided by natural conformations and sequences

Computational protein design methods often use a known molecule with a well-characterised structure as a template or scaffold. The chosen scaffold is modified so that its function (e.g. what it binds) is repurposed. Ideally, one wants to be confident that the expressed protein’s structure is going to be the same as the designed conformation. Therefore, successful designed proteins tend to be rigid, formed of collections of regular secondary structure (e.g. α-helices and β-sheets) and have active site shapes that do not perturb far from the scaffold’s backbone conformation (see this review).

A recent paper (Lapidoth et al 2015) from the Fleishman group proposes a new protocol to incorporate backbone variation (read loop conformations) into computational protein design (Figure 1). Using an antibody as the chosen scaffold, their approach aims to design a molecule that binds a specific patch (epitope) on a target molecule (antigen).

Figure 1 from Lapidoth et al 2015 shows an overview of the AbDesign protocol

Protein design works in the opposite direction to structure prediction. i.e. given a structure tell me what sequence will allow me to achieve that shape and to bind a particular patch in the way I have chosen. To do this one first needs to select a shape that could feasibly be achieved in vivo. We would hope that if a backbone conformation has previously been seen in the Protein Data Bank that it is one of such a set of feasible shapes.

Lapidoth et al sample conformations by constructing a backbone torsion angle database derived from known antibody structures from the PDB. From the work of North et al and others we also know that certain loop shapes can be achieved with multiple different sequences (see KK’s recent post). The authors therefore reduce the number of possible backbone conformations by clustering them by structural similarity. Each conformational cluster is represented by a representative and a position specific substitution matrix (PSSM). The PSSM represents how the sequence can vary whilst maintaining the shape.

The Rosetta design pipeline that follows uses the pre-computed torsion database to make a scaffold antibody structure (1x9q) adopt different backbone conformations. Proposed sequence mutations are sampled from the corresponding PSSM for the conformation. Shapes and the sequences that can adopt them, are ranked with respect to a docked pose with the antigen using several structure-based filters and Rosetta energy scores. A trade off is made between predicted binding and stability energies using a ‘fuzzy logic’ scheme.

After several rounds of optimisation the pipeline produces a predicted structure and sequence that should bind the chosen epitope patch and fold to form a stable protein when expressed. The benchmark results show promise in terms of structural similarity to known molecules that bind the same site (polar interactions, buried surface area). Sequence similarity between the predicted and known binders is perhaps lower than expected. However, as different natural antibody molecules can bind the same antigen, convergence between a ‘correct’ design and the known binder may not be guaranteed anyway.

In conclusion, my take home message from this paper is that to sensibly sample backbone conformations for protein design use the variation seen in known structures. The method presented demonstrates a way of predicting more structurally diverse designs and sampling the sequences that will allow the protein to adopt these shapes. Although, as the authors highlight, it is difficult to assess the performance of the protocol without experimental validation, important lessons can be learned for computational design of both antibodies and general proteins.

5 Thoughts For… Comparing Crystallographic Datasets

Most of the work I do involves comparing diffraction datasets from protein crystals. We often have two or more different crystals of the same crystal system, and want to spot differences between them. The crystals are nearly isomorphous, so that the structure of the protein (and crystal) is almost identical between the two datasets. However, it’s not just a case of overlaying the electron density maps, subtracting them and looking at the difference. Nor do we necessarily want to calculate Fo-Fo maps, where we calculate the difference by directly subtracting the diffraction data before calculating maps. By the nature of the crystallographic experiment, no two crystals are the same, and two (nearly identical) crystals can lead to two quite different datasets.

So, here’s a list of things I keep in mind when comparing crystallographic datasets…

Control the Resolution Limits

1) Ensure that the resolution limits in the datasets are the same, both at the high AND the low resolution limits.

The High resolution limit. The best known, and (usually) the most important statistic of a dataset. This is a measure of the amount of information that’s been collected about the dataset. Higher resolution data gives more detail for the electron density. Therefore, if you compare a 3A map to a 1A map, you’re comparing fundamentally different objects, and the differences between them will be predominantly from the different amount of information in each dataset. It’s then very difficult to ascertain what’s interesting, and what is an artefact of this difference. As a first step, truncate all datasets at the resolution you wish to compare them at.

The Low Resolution Limit. At the other end of the dataset, there can be differences in the low resolution data collected. Low resolution reflections correspond to much larger-scale features in the electron density. Therefore, it’s just as important to have the same low-resolution limit for both datasets, otherwise you get large “waves” of electron density (low-frequency fourier terms) in one dataset that are not present in the other. Because low-resolution terms are much stronger than high resolution reflections, these features stand out very strongly, and can also obscure “real” differences between the datasets you’re trying to compare. Truncate all datasets at the same low resolution limit as well.

Consider the Unit Cell

2) Even if the resolution limits are the same, the number of reflections in maps can be different.

The Unit Cell size and shape. Even if the crystals you’re using are the same crystal form, no two crystals are the same. The unit cell (the building block of the crystal) can be slightly different sizes and shapes between crystals, varying in size by a few percent. This can occur by a variety of reasons, from the unpredictable process of cooling the crystal to cryogenic temperatures to entirely stochastic differences from the process of crystallisation. Since the “resolution” of reflections depends on the size of the unit cell, two reflections with the same miller index can have different “resolutions” when it comes to selecting reflections for map calculation. Therefore, if you’re calculating maps from nearly-isomorphous but non-identical crystals, consider calculating maps based on an high and a low miller index cutoff, rather than a resolution cutoff. This ensures the same amount of information in each map (number of free parameters).

Watch for Missing Reflections

3) Remove any missing reflections from both datasets.

Reflections can be missing from datasets for a number of reasons, such as falling into gaps/dead pixels on the detector. However, this isn’t going to happen systematically with all crystals, as different crystals will be mounted in different orientations. When a reflection is missed in one dataset, it’s best to remove it from the dataset you’re comparing it to as well. This can have an important effect when the completeness of low- or high-resolution shells is low, whatever the reason.

Not All Crystal Errors are Created Equal…

4) Different Crystals have different measurement errors.

Observation uncertainties of reflections will vary from crystal to crystal. This may be due to a poor-quality crystal, or a crystal that has suffered from more radiation damage than another. These errors lead to uncertainty and error in the electron density maps. Therefore, if you’re looking for a reference crystal, you probably want to choose one with as small uncertainties, σ(F), in the reflections as possible.

Proteins are Flexible

5) Even though the crystals are similar, the protein may adopt slightly difference conformations.

In real-space, the protein structure varies from crystal to crystal. For the same crystal form, there will be the same number of protein copies in the unit cell, and they will be largely in the same conformation. However, the structures are not identical, and the inherent flexibility of the protein can mean that the conformation seen in the crystal can change slightly from crystal to crystal. This effect is largest in the most flexible regions of the protein, such as unconstrained C- and N- termini, as well as flexible loops and crystal contacts.

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm) and interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python. The way you set this up is:

C code rmsd.cpp

extern “C” {
   void superpose_c(const char* file1, consta char* file2, double* rmsd){
       //perform superposition
       *rmsd = compute_rmsd();
   }
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./rmsdlib.so’)

#convert python string to c char array function
def to_c_string(py_string)
   c_string = c_char * (len(py_string)+1)()
   for i in range(len(py_string)):
       c_string[i] = py_string[i]
   c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
   output_mtx.lock();
   //Unsafe operation
   files.append(filename);
   output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.

Ten Simple Rules for a Successful Cross-Disciplinary Collaboration

The name of our research group (Oxford Protein Informatics Group) already indicates its cross-disciplinary character. In doing research of this type, we can acquire a lot of experience in working across the boundaries of research fields. Recently, one of our group members, Bernhard Knapp, became lead author of an article about guidelines for cross-disciplinary research. This article describes ten simple rules which you should consider if working across several disciplines. They include going to the other lab in person, understanding different rewards models, having patience with the pace of other disciplines, and recognising the importance of synergy.

The ten rules article was even picked up by a journalist of the “Times Higher Education” and further discussed in the newspaper.

Happy further interdisciplinary work!

Clustering Algorithms

Clustering is a task of organizing data into groups (called clusters), such that members of each group are more similar to each other than to members of other groups. This is a brief description of three popular clustering algorithms – K-Means, UPGMA and DBSCAN.

Cluster analysis

K-Means is arguably the simplest and most popular clustering algorithm. It takes one parameter – the expected number of clusters k. At the initialization step a set of k means m₁, m₂,…, m_kis generated (hence the name). At each iteration step the objects in the data are assigned to the cluster whose mean yields the least within-cluster sum of squares. After the assignments, the means are updated to be centroids of the new clusters. The procedure is repeated until convergence (the convergence occurs when the means no longer change between iterations).

The main strength of K-means is that it’s simple and easy to implement. The largest drawback of the algorithm is that one needs to know in advance how many clusters the data contains. Another problem is that with wrong initialization the algorithm can easily converge to a local minimum, which may result in suboptimal partitioning of data.

UPGMA is a simple hierarchical clustering method, where the distance between two clusters is taken to be the average of distances between individual objects in the clusters. In each step the closest clusters are combined, until all objects are in clusters where the average distance between objects is lower than a specified cut-off.

The UPGMA algorithm is often used for construction of phenetic trees. The major issue with the algorithm is that the tree it constructs is ultrametric, which means that the distance from root to any leaf is the same. In context of evolution, this means that the UPGMA algorithm assumes a constant rate of accumulation of mutations, an assumption which is often incorrect.

DBSCAN is a density-based algorithm which tries to separate the data into regions of high density, labelling points that lie in low-density areas as outliers. The algorithm takes two parameters – ε and minPts and looks for points that are density-connected with respect to ε. A point p is said to be density reachable from point q if there are points between p and q, such that one can traverse the path from p to q never moving further than ε in any step. Because the concept of density-reachability is not symmetric a concept of density-connectivity is introduced. Two points p and q are density-connected if there is a point o such that both p and q are density reachable from o. A set of points is considered a cluster if all points in the set are mutually density-connected and the number of points in the set is equal to or greater than minPts. The points that cannot be put into clusters are classified as noise.

a) Illustrates the concept of density-reachability.
b) Illustrates the concept of density-connectivity

The DBSCAN algorithm can efficiently detect clusters with non-globular shapes since it is sensitive to changes in density only. Because of that, the clustering reflects real structure present in the data. The problem with the algorithm is the choice of parameter ε which controls how large the difference in density needs to be to separate two clusters.

Investigating structural mechanisms with customised Natural Moves

This is a brief overview of my current work on a protocol for studying molecular mechanisms in biomolecules. It is based on Natural Move Monte Carlo (NMMC), a technique pioneered by one of my supervisors, Peter Minary.

NMMC allows the user to decompose a protein or nucleic acid structure into segments of atoms/residues. Each segment is moved collectively and treated as a single body. This gives rise to a sampling strategy that considers the structure as a fluid of segments and probes the different arrangements in a Monte Carlo fashion.

Traditionally the initial decomposition would be the only one that is sampled. However, this decomposition might not always be optimal. Critical degrees of freedom might have been missed out or chosen sub-optimally. Additionally, if we want to test the causality of a structural mechanism it can be informative to perform NMMC simulations for a variety of decompositions. Here I show an example of how customised Natural Moves may be applied on a DNA system.

Investigating the effect of epigenetic marks on the structure of a DNA toy model

Figure 1. Three levels at which epigenetic activity takes place. Figure adopted from Charles C. Matouk, and Philip A. Marsden Circulation Research. 2008;102:873-887

Epigenetic marks on DNA nucleotides are involved in regulating gene expression (Point 1 in figure 1). We have a limited understanding of the underlying molecular mechanism of this process. There are two mechanisms that are thought to be involved: 1) Direct recognition of the epigenetic mark by DNA binding proteins and 2) indirect recognition of changes in the local DNA structure caused by the epigenetic mark. Using customised Natural Moves we are currently trying to to gain insight into these mechanisms.

One type of epigenetic mark is the 5-hydroxymethylation (5hm) on cytosines. Lercher et al. have recently solved a crystal structure of the Drew-Dickerson Dodecamer with two of these epigenetic marks attached. Given the right sequence context (e.g. CpG) they have found that this epigenetic mark can form a hydrogen bond between two neighbouring bases on the same strand. This raises the question: Can an intra-strand CpG hydrogen bond alter the DNA helical parameters? We used this as a toy model to test our technology.

Figure 2b

Figure 2a

Figure 2a shows the Drew-Dickerson Dodecamer schematically. Figure 2b shows the three sets of degrees of freedom that we used as test cases.

Top (11): Default sampling – Serves as a reference simulation.
Middle (01): Fixed 5hm torsions – By forcing the hydroxyl group of 5hmC towards the guanine we significantly increase the chance of the hydrogen bond forming.
Bottom (00): Collective movement of the neighbouring bases 5hmC and G – By grouping the two bases into a segment we aim to emulate the dampening effect that a hydrogen bond may have on their movements relative to each other.

By modifying these degrees of freedom (1=active/ 0=inactive) we attempted to amplify any effects that the CpG hydrogen bond may have on the DNA structure.

Below you can see animations of the three test cases 11, 01, 00 during simulation, zoomed in on the 5hm modification and the neighbouring base pair:

It appeared that the default degrees of freedom were not sufficient to detect a change in the DNA structure when comparing simulations of unmodified and modified structures. The other two test cases with customised Natural Moves, however, showed alterations in some of the helical parameters.

Lercher et al. saw no differences in their modified and unmodified crystal structures. It seems that single 5hm epigenetic marks are not sufficient to significantly alter DNA structure. Rather, clusters of these modifications with accumulated structural effects may be required to cause significant changes in DNA helical parameters. CpG islands may be a promising candidate.

@samdemharter

SAS-5 assists in building centrioles of nematode worms Caenorhabditis elegans

We have recently published a paper in eLife describing the structural basis for the role of protein SAS-5 in initiating the formation of a new centriole, called a daughter centriole. But why do we care and why is this discovery important?

We, as humans – a branch of multi-cellular organisms, are in constant demand of new cells in our bodies. We need them to grow from an early embryo to adult, and also to replace dead or damaged cells. Cells don’t just appear from nowhere but undergo a tightly controlled process called cell cycle. At the core of cell cycle lies segregation of duplicated genetic material into two daughter cells. Pairs of chromosomes need to be pulled apart millions of millions times a day. Errors will lead to cancer. To avoid this apocalyptic scenario, evolution supplied us with centrioles. Those large molecular machines sprout microtubules radially to form characteristic asters which then bind to individual chromosomes and pull them apart. In order to achieve continuity, centrioles duplicate once per cell cycle.

Similarly to many large macromolecular assemblies, centrioles exhibit symmetry. A few unique proteins come in multiple copies to build this gigantic cylindrical molecular structure: 250 nm wide and 500 nm long (the size of a centriole in humans). The very core of the centriole looks like a 9-fold symmetrical stack of cartwheels, at which periphery microtubules are vertically installed. We study protein composition of this fascinating structure in the effort to understand the process of assembling a new centriole.

Molecular architecture of centrioles.

SAS-5 is an indispensable component in C. elegans centriole biogenesis. SAS-5 physically associates with another centriolar protein, called SAS-6, forming a complex which is required to build new centrioles. This process is regulated by phosphorylation events, allowing for subsequent recruitment of SAS-4 and microtubules. In most other systems SAS-6 forms a cartwheel (central tube in C. elegans), which forms the basis for the 9-fold symmetry of centrioles. Unlike SAS-6, SAS-5 exhibits strong spatial dynamics, shuttling between the cytoplasm and centrioles throughout the cell cycle. Although SAS-5 is an essential protein, depletion of which completely terminates centrosome-dependent cell division, its exact mechanistic role in this process remains obscure.

IN BRIEF: WHAT WE DID
Using X-ray crystallography and a range of biophysical techniques, we have determined the molecular architecture of SAS-5. We show that SAS-5 forms a complex oligomeric structure, mediated by two self-associating domains: a trimeric coiled coil and a novel globular dimeric Implico domain. Disruption of either domain leads to centriole duplication failure in worm embryos, indicating that large SAS-5 assemblies are necessary for function. We propose that SAS-5 provides multivalent attachment sites that are critical for promoting assembly of SAS-6 into a cartwheel, and thus centriole formation.

For details, check out our latest paper 10.7554/eLife.07410!

@kbrogala

Top panel: cartoon overview of the proposed mechanism of centriole formation. In cytoplasm, SAS-5 exists at low concentrations as a dimer, and each of those dimers can stochastically bind two molecules of SAS-6. Once SAS-5 / SAS-6 complex is targeted to the centrioles, it starts to self-oligomerise. Such self-oligomerisation of SAS-5 allows for the attached molecules of SAS-6 to form a cartwheel. Bottom panel: detailed overview of the proposed process of centriole formation. In cytoplasm, where concentration of SAS-5 is low, the strong Implico domain (SAS-5 Imp, ZZ shape) of SAS-5 holds the molecule in a dimeric form. Each SAS-5 protomer can bind (through the disordered linker) to the coiled coil of dimeric SAS-6. Once SAS-5 / SAS-6 complex is targeted to the site where a daughter centriole is to be created, SAS-5 forms higher-order oligomers through self-oligomerisation of its coiled coil domain (SAS-5 CC – triple horizontal bar). Such large oligomer of SAS-5 provides multiple attachments sites for SAS-6 dimers in a very confied space. This results in a burst of local concentration of SAS-6 through the avidity effect, allowing an otherwise weak oligomer of SAS-6 to also form larger species. Effectively, this seeds the growth of a cartwheel (or a spiral in C. elegans), which in turn serves as a template for a new centriole.

Investigating GPCR kink variation

G-protein coupled receptors (GPCRs) are the target of 50-60% of drugs, including many of those involved in the treatment of cancer and cardiovascular disease. Over 100 GPCR crystal structures are now available, but these are for only around 30 different receptors, and there are still hundreds more receptors for which no structure exists. There is huge diversity in the ligands which bind to GPCRs, so it may often be difficult to predict the shape of a binding pocket for a specific receptor of interest, especially if no close relatives have a structure solved.

Helix kinks (see previous blog posts) are a structural feature of GPCRs which are thought to be important for function. An ability to predict their presence and the magnitude of helix direction change is important for obtaining an accurate structure. A kink prediction method has already been used in the context of GPCR structure prediction, which scored the overall structures after replacing kink segments with others from a database. This made it possible to predict the change in a kink angle based on the stability of the whole GPCR structure.

To better inform this kind of modelling, we wanted to investigate specifically how much variation there is in kink angles between GPCRs. To do this we used the tool Kink Finder to measure angles in all of the transmembrane helices of the GPCRs in the GPCRDB, and estimate a confidence interval on those angles. Then we could state whether the variation that we see in GPCR kink angles is greater than what we would expect from measurement error alone.

Each helix appears to show different behaviour. Some helices were very well conserved, but others showed a huge amount of variation. For these helices with very variable angles, it would be interesting to know if this is a change related to sequence differences, or conformational flexibility between more than one preferred conformation. We found an example where significantly different angles were found even in the same receptor. In this case, the kink angle size is related to whether the structure has an agonist or an antagonist bound, so we propose that this is a functionally relevant and flexible kink.

We also carried out the same analysis on helices from other families of membrane and soluble proteins, and found many more highly variable kinks (one example shown below). This shows that they should be a very important consideration when carrying out homology modelling, and that their conformational flexibility could also be important for function in many other contexts.

Submitting your thesis!

Writing and submitting your thesis is (almost) the final stage of completing your PhD. It can be the most stressful and unpleasant part of the process… but it can also be rewarding to see the story of your last three years’ work fall into place.

All I want for christmas is… “Piled Higher and Deeper” by Jorge Cham (www.phdcomics.com)

This post is a miscellaneous collection of advice and resources about the submission process, most of which have been passed down from the very first members of OPIG. Hopefully it will be useful to have it all in the same place for present and future members. Feel free to comment here if you have any tips I have missed!

All information and links that I’ve included are correct at the time of writing (for Oxford University Statistics students) but you should always use the university’s guidelines as your primary resource.

The very beginning: the plan

Don’t spend too long on this! But you should have an idea of your planned chapter titles and an overall story for your book. Also useful is a timeline for when you will finish drafts of chapters by. Try to be realistic with this. If you decide to change your thesis title you should fill out an application for change of thesis title form (GSO.6). Make sure you look up any restrictions (word/page limits etc.) which may apply, and confirm your hand-in date.

Starting writing

It’s a good idea to decide what you will use to write your thesis. Most OPIG members use LaTeX. There are some great thesis templates out there but the one most people tend to use is one from Cambridge’s Engineering department. You can do a fair bit of customisation within that template… changing fonts, headers, titles and more, but it’s a great starting point.

When the finish line’s in sight: choosing examiners

A couple of months before you are planning to submit your thesis you should discuss with your supervisor(s) potential examiners. Your supervisor can informally check with them if they are happy to examine you and then you should fill out an appointment of examiners form (GSO.3). You can also change your thesis title on this form without filling in GSO.6.

Finishing writing

Your final document is likely to be over 100 pages with thousands of words (or potential typos as you might come to call them). A great LaTeX spell checker is aspell which should already be installed on your work machine. To spell check a .tex file (ignoring all TeX notation… apart from multiple citations I found!) using a British dictionary simply type:

aspell --lang=en_GB -t -c filename.tex

You’re absolutely guaranteed to still have typos floating around but it’s a decent start. You (and others if you can get them) should manually proof-read as well!

Final Formatting

Your thesis should be set out on numbered, portrait A4 pages. It should be double spaced and the inner (bound) margin should be 3-3.5cm. For more details on the formatting required check out the university’s regulations.

Printing and binding

When you’re happy with your proof-reading (you’re still almost guaranteed to have remaining typos) you’ll have to print and bind your finished book! To comply with university guidelines you will need to submit two copies, for each of your examiners, to the exam schools. You may also like to print a copy for yourself (you will need one to take with you into your viva). Before you start, if you are printing in colour at the department make sure you have enough printer credit by emailing IT (let them know the printer and your Bod card number and they will top you up if necessary).

If you are planning to print your copies double sided you may want to buy your own paper of higher quality than that provided by the department (at least 100gsm). As of October 2014, the Oxford Print Centre was selling the cheapest packs of 100gsm paper we could find but sold out close to deadline day! Also check out WHSmiths or Ryman’s.

You might want to do a test run of a few colour pages of your thesis before you send the whole file to be printed. Printing at 1200dpi (instead of the default 600dpi) can improve the appearance of your figures considerably. You may want to stay late at the office to print so you are not disturbed by other print jobs during office hours.

Your thesis should be securely bound in either hard or soft cover. Loose-leaf or spiral binding won’t be accepted. There are several binding facilities through Oxford but I used the Oxford Print Centre just down the road, which also guarantees a one hour service for soft binding even on submission days.

Submission

Submit your completed copies to the exam schools, noting their opening hours (08.30-17:00, Monday to Friday), take the traditional photo, and bask in your newly found FREEDOM (try to forget about the viva!).

Journal Club: The Origin of CDR H3 Structural Diversity

Antibody binding site is broadly composed of the six hypervariable loops, the CDRs. There are three loops on the antibody light chain (L1, L2 and L3) and three loops on the antibody heavy chain (H1, H2 and H3).

Out of the six loops, five appear to adopt a constrained set of structural conformations (L1, L2, L3, H1 and H2). The conformations of H3 appear much less constrained, which was suggested to be the result of its higher relative importance in antigen recognition (however it is not a necessary condition). The only observations to date about the shapes of CDR-H3 is the existence of the extended and kinked conformations of its anchor.

The function of the kink was investigated recently by Weitzner et al. Here, the authors contrasted the geometry found in the antibody CDR-H3 loops to a set of 15k non-antibody polypeptides. They found that even though the extended conformation appears to be more favorable, the kinked one can also be found in many cases, particularly in the PDZ domains.

Weitzner et al. find that the extended conformation is much more common in non-antibody loops. However, the kinked conformation, even though less frequent is not outright rare. The situation is the opposite in antibodies where the majority of H3 conformations are kinked rather than extended.

The authors contrasted the sequence patterns of kinked antibody loops and kinked non-antibody loops and did not find anything predictive of the kinked conformation — suggesting that the effect might be non-local. Nonetheless, the secondary structure pattern of the kinked H3 and the kinked non-antibody loops appears similar.

Even though there might be no sequence-kink link, the authors indicate how their findings might improve H3 structure prediction. They demonstrate that admixing the kinked non-antibody loops into a template dataset for an H3 modeling software might provide more relevant templates.

In conclusion, the main message of the paper (selon moi) is putting forward of the hypothesis as to the role of the H3 kink. Since the kink is much more prevalent in H3 than in non-antibody proteins, there is a strong suggestion that there might be a special role for it. The authors suggest that the kinked conformation allows for more structural diversity, that would otherwise be restricted in the more rigid beta-stranded extended conformation. Thus, antibodies might have opted for a system wherein, they do not need to add dramatic mutations to their H3 in order to get more structural flexibility.