Category Archives: How To

How to do things. doh.

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm) and interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python. The way you set this up is:

C code rmsd.cpp

extern “C” {
   void superpose_c(const char* file1, consta char* file2, double* rmsd){
       //perform superposition
       *rmsd = compute_rmsd();
   }
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./rmsdlib.so’)

#convert python string to c char array function
def to_c_string(py_string)
   c_string = c_char * (len(py_string)+1)()
   for i in range(len(py_string)):
       c_string[i] = py_string[i]
   c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
   output_mtx.lock();
   //Unsafe operation
   files.append(filename);
   output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.

Submitting your thesis!

Writing and submitting your thesis is (almost) the final stage of completing your PhD. It can be the most stressful and unpleasant part of the process… but it can also be rewarding to see the story of your last three years’ work fall into place.

All I want for christmas is… “Piled Higher and Deeper” by Jorge Cham (www.phdcomics.com)

This post is a miscellaneous collection of advice and resources about the submission process, most of which have been passed down from the very first members of OPIG. Hopefully it will be useful to have it all in the same place for present and future members. Feel free to comment here if you have any tips I have missed!

All information and links that I’ve included are correct at the time of writing (for Oxford University Statistics students) but you should always use the university’s guidelines as your primary resource.

The very beginning: the plan

Don’t spend too long on this! But you should have an idea of your planned chapter titles and an overall story for your book. Also useful is a timeline for when you will finish drafts of chapters by. Try to be realistic with this. If you decide to change your thesis title you should fill out an application for change of thesis title form (GSO.6). Make sure you look up any restrictions (word/page limits etc.) which may apply, and confirm your hand-in date.

Starting writing

It’s a good idea to decide what you will use to write your thesis. Most OPIG members use LaTeX. There are some great thesis templates out there but the one most people tend to use is one from Cambridge’s Engineering department. You can do a fair bit of customisation within that template… changing fonts, headers, titles and more, but it’s a great starting point.

When the finish line’s in sight: choosing examiners

A couple of months before you are planning to submit your thesis you should discuss with your supervisor(s) potential examiners. Your supervisor can informally check with them if they are happy to examine you and then you should fill out an appointment of examiners form (GSO.3). You can also change your thesis title on this form without filling in GSO.6.

Finishing writing

Your final document is likely to be over 100 pages with thousands of words (or potential typos as you might come to call them). A great LaTeX spell checker is aspell which should already be installed on your work machine. To spell check a .tex file (ignoring all TeX notation… apart from multiple citations I found!) using a British dictionary simply type:

aspell --lang=en_GB -t -c filename.tex

You’re absolutely guaranteed to still have typos floating around but it’s a decent start. You (and others if you can get them) should manually proof-read as well!

Final Formatting

Your thesis should be set out on numbered, portrait A4 pages. It should be double spaced and the inner (bound) margin should be 3-3.5cm. For more details on the formatting required check out the university’s regulations.

Printing and binding

When you’re happy with your proof-reading (you’re still almost guaranteed to have remaining typos) you’ll have to print and bind your finished book! To comply with university guidelines you will need to submit two copies, for each of your examiners, to the exam schools. You may also like to print a copy for yourself (you will need one to take with you into your viva). Before you start, if you are printing in colour at the department make sure you have enough printer credit by emailing IT (let them know the printer and your Bod card number and they will top you up if necessary).

If you are planning to print your copies double sided you may want to buy your own paper of higher quality than that provided by the department (at least 100gsm). As of October 2014, the Oxford Print Centre was selling the cheapest packs of 100gsm paper we could find but sold out close to deadline day! Also check out WHSmiths or Ryman’s.

You might want to do a test run of a few colour pages of your thesis before you send the whole file to be printed. Printing at 1200dpi (instead of the default 600dpi) can improve the appearance of your figures considerably. You may want to stay late at the office to print so you are not disturbed by other print jobs during office hours.

Your thesis should be securely bound in either hard or soft cover. Loose-leaf or spiral binding won’t be accepted. There are several binding facilities through Oxford but I used the Oxford Print Centre just down the road, which also guarantees a one hour service for soft binding even on submission days.

Submission

Submit your completed copies to the exam schools, noting their opening hours (08.30-17:00, Monday to Friday), take the traditional photo, and bask in your newly found FREEDOM (try to forget about the viva!).

Quick Standalone BLAST Setup for Ubuntu Linux

Some people run into trouble trying to setup a standalone version of BLAST using the NCBI instructions. Here a stremalined process will be presented, targeted at Ubuntu.

I assume that you are aware of the paradigms of blast, meaning that there are several executables for searching nucleic acids or proteins and there are different databases you can blast against. Sinon, you should read up on the available search tools and databases before you attempt to install Blast. NB, throughout this document, I am using protein blast and protein input – changing to nucleotide sequences is trivial as you just change blastp to blastn and ‘prot’ to ‘nt’ in obvious places (and of course you use different queries and target databases).

Without further ado, Blast setup for UNIX.

There are two components for the installation:

Executables (bastn, blastp etc.)
Databases. (nr, nt etc.)

Both are described below with follow-up examples of usage.

Ad.1 The executables can be downloaded and compiled from here (download the source, run ./configure then make and finally make install in the directory of the untarred file). However a much easier way to do it under Ubuntu is:

sudo apt-get install ncbi-blast+

This automatically installs everything. In both cases to check if all went ok, type:

which blastp

If you get a directory such as /usr/local/bin than all went well and that’s where your executables are.

Ad.2 FIrst, you need to decide on where to store the databases. Do this by setting the environment variable:

export BLASTDB=/path/to/blastdbs/of/your/chosing

Now, we can either use one of the ncbi-curated databases or create our own. We will do both.

A) Downloading and using an ncbi-curated database.

The databases can be downloaded using the update_blastdb script. As an example I will download a non redundant protein database which is referred to as ‘nr’:

cd $BLASTDB
sudo update_blastdb --passive --timeout 300 --force --verbose nr
ls *.gz |xargs -n1 tar -xzvf
rm *.gz.*

The penultimate command extracts all the files you have downloaded and the last one removes the downloaded archives.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db nr -query somesequence.fasta

Done.

B) Creating your own database.

Firstly, put a bunch of fasta protein sequences into a file called sample.fa

Next, execute the following

makeblastdb -in sample.fa -dbtype 'prot' -out NewDb
mv NewDB* $BLASTDB/

We have now created a blast protein database from your fasta file, called NewDB. The last line simply moves all the blast files to the database directory.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db NewDb -query somesequence.fasta

Done.

Afterword

These instructions are the shortest way I could find to get a working stand-alone BLAST application. If you require more info, you can look here.

A Colourblind Guide to Colourful Presentations…

Like many people, I am colourblind.

Fortunately I am only ‘mildly’ red-green colourblind and it doesn’t have a huge detrimental effect on my life.

Firstly, to dispel a couple of misconceptions:

I can still see colour. ‘blindness’ here would be better called ‘deficiency’ or ‘desensitivity’. I am simply less sensitive to reds/greens than the ‘normal’ eye. Whilst I can discriminate between blues/yellows, it is harder to distinguish some reds from some greens.
Colour blindness is not the swapping of colours. I don’t accidentally call red and green the wrong things – I just can’t tell what a colour is in some cases.
I have no problem with traffic lights.
Colour blindness does not mean poor eyesight. My cornea, lens, etc work fine, thank-you-very-much.

Approximately 8% of men and 0.5% of women are colourblind to various extents. There is a wide range of types, and severities, of colourblindness. For more information, there are a number of websites with helpful descriptions – This, for example…

There’s even a nature paper about colour blindness awareness…

The standard tests for colour-blindness are the well-recognised Ishihara colour tests. Lets do a few (just for fun)…

An example Ishihara Colour Test. Most people can see the ’12’ in the image. Image: Wikipedia

Another Colour Test. You might be able to see a ‘6’ in the image (I can’t…). Image: Wikipedia

Another Colour Test. You should see nothing in this image. I see a ‘2’. Image: Wikipedia

The last one. You should see a ’42’. I can just about see a nondescript blur that might be a ‘4’ on the left hand side. Image: Wikipedia

To give an idea of what it’s like, this page gives a very good example. For a theatre booking system, they indicate the seats that offer a restricted view of the stage –

Restricted view seats are marked in orange - or are they?

Restricted view seats are clearly indicated – or are they? Image: www.digitalartsonline.co.uk

Whilst most people will be able to tell where the best seats are, for those with colour blindness it might not be so easy. The image below shows the same image from the point of view of someone with colour blindness – can you still be sure of which seat is which?

Still clear? Image: www.digitalartsonline.co.uk

Mostly, being colourblind doesn’t affect my life (when I’m not at the theatre). However, there is one area of my life where being colourblind is *really* annoying: presentations (and picking ties to match shirts, but I’ve got that figured out now).

So here’s the Nick-approved guide to making colour-blind friendly presentations.

Choose a colour scheme that is colour-blind friendly – these are readily available online. This is mainly for graphs. Just generally avoid pale green-pale red mixtures. Purples and pinks can also be pretty confusing.
Along with the above, high contrast colour schemes can be very hard to see. For instance, a presentation with a white background can make it difficult to see coloured things on the slide, as everything is drowned out by the white background – especially yellow/green text. It is also very tiring to the eye. Try dark-coloured fonts on a light-coloured background.
In graphs, don’t just use colours to match lines to the legend – matching colours from lines to the colours on the legend is hard – use shapes as well, or label the lines. An example.
If 3. is impossible, make the lines on graphs a decent thickness – small areas of colour are harder to determine.
When referring to slide, try not to refer to ‘the red box’. Refer instead to ‘the rounded red box in the top-right of the screen’.
Please don’t use red laser pointers – these are evil [citation needed]. The red light is not easily distinguishable on bright screens (or if it’s zipping around the screen). Use a green laser pointer instead. Not only are green laser pointers generally more powerful, and therefore brighter, but they are also easier to see. Why?

For a fairly comprehensive guide of how to make colour-friendly presentations, look at this page. And for checking how things might look, there are many colour-blind simulators for both images and webpages.

I hope this helps to create colour-friendly presentations.

Get PDB intermolecular protein contacts and interface residues

Very often in Struc Bio it is necessary to determine the contacts between two molecules. Most of us in the group have written a snippet of code to compute precisely that or they have adapted the Biopython functionality or one of the tools in pdbtools. The piece of code written in Python presented here is a Biopython variety that gives you the intermolecular contacts and it annotates the interface neighborhood. An example of the program output is given in the Figure below:

The complex between an antibody and an antigen is shown on the left without the annotation. On the right it is shown with intermolecular contacts annotated in red (4.5A distance) and the interface neighborhood shown in green (10A away from any contact residue).

Download

You can download it from here. There are three files inside:

GetInterfaces.py – the main source/runnable file
README.txt – instructions, very similar to this post (quite a lot copy/pasted)
1A2Y.pdb – the PDB file used in the example to practice on.

Requirements:

You need Biopython. (if you are from OPIG or any other Bioinformatics group, most likely it is already installed on your machine). You can download it from here.

How to use it?

As the bare minimum, you need to provide the structure of the pdb(s) and the chains that you want to examine contacts in.

Input options:

–f1 : first pdb file [Required]
–f2 : second pdb file (if the contacts are to be calculated in the same molecule, just submit the same pdb in both cases) [Required]
–c1 : Chains to be used for the first molecule [Required]
–c2 : Chains to be used for the second molecule [Required]
–c : contact cutoff for intermolecular contacts (Optional, set to 4.5A if not supplied on input)
–i : interface neighbor cutoff for intramolecular neighborhood of the contacting interface (Optional, set to 10.0A if not supplied on input). Set this option to zero (0.0) if you only want to get the intermolecular contacts in the interface, without the interface neighborhood.
–jobid : name for the output folder (Set to out_<random number> if not supplied on input)

An example which you can just copy paste and run when in the same directory as the python script:

python GetInterfaces.py --f1 1A2Y.pdb --f2 1A2Y.pdb --c1 AB --c2 C --c 4.5 --i 10.0 --jobid example_output

Above command will calculate the contacts between antibody in 1a2y (chains A and B) and the antigen (chain C). The contact distance was defined as 4.5A and the interface distance was defined as 10A. All the output files are saved in the folder out_example_output.

Output

Output folder is placed in the current directory. If you specify the output folder name (–jobid) it will be saved under the name ‘out_[whateveryoutyped]’, otherwise it will be ‘out_[randomgeneratednumber]’. The program tells you at the end where it saved all the files.

Input options:

molecule_1.pdb – the first supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_2.pdb – the second supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_1.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_2.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_1_constrained.pdb – the first molecule, which is constrained only to the residues in the interface.
molecule_2_constrained.pdb – the second molecule, which is constrained only to the residues in the interface.
parameters.txt – the contact distance and neighborhood distance used for the particular run.

Django for scientific applications

In my current work I am developing a cheminformatics tool using structural and activity data to investigate protein-ligand binding. I ~~have only ever properly used~~ love python and I listen to Saulo, so I decided to used Django to develop my application. I didn’t understand what it was and why it might be useful before I started using it but below I thought I’d discuss a few of the features that I think have been useful and might encourage others to use it.

Firstly I will outline how Django works. I wanted to download all the PDB structures for CDK2 and store the information in a data structure that is robust and easily used. We have a Target and a Protein. A Target is associated to a particular UniProt accession. Cyclin-dependent kinase 2 (CDK2) is a Target. A Protein is a set of 3D coordinates, so 1AQ1 is a Protein.

class Target(models.Model):
"""A Django model to define a given protein target"""
    UniProt = models.CharField(max_length=20,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    Title = models.CharField(max_length=10)

In the above Target model I have three different fields. The first field denotes the UniProt accession for the Target and is “unique”. This means that only one Target can have any given UniProt accession in my data structure. If I try to add another with the same value in the UniProt field it will throw an exception. The second field denotes the time and date that the model was created. This means I can check back to when the target was created. The third is the Title I would like to use for this, for example CDK2.

I can then make a new Target objects by:

new_target = Target()
new_target.Title = "CDK2"
new_target.UniProt = "P24941"

and save it to the database by:

new_target.save() # Django takes care of the required SQL

The next model is for the Protein molecules:

class Protein(models.Model):
    """A Django model to define a given protein"""
    Code = models.CharField(max_length=6,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    TargetID = models.ForeignKey(Target)
    Apo = models.BoolenField()
    PDBInfo = models.FileField(upload_to='pdb')

The model contains the PDB Code, e.g. 1AQ1, and the date it was added to the database. It also consists of a foreign key, relating it to its Target and a boolean indicating if the structure is apo or holo. Finally there is a file field relating this entry to the appropriate file path where the PDB information is stored.

Once the data has been added to the database, Django then deals with all SQL queries from the database:

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
CDK2_prots = Protein.objects.filter(TargetID__Title="CDK2") # All PDB entries associated to CDK2, as a query set, behaving similarily to a list
CDK2_list = [x for x in CDK2_prots] # Now exactly like a list

The “__” in the above query allows one to span the foreign key relationship, so it is searching for the Title of the Target not the Title of the Protein. Finally I can then access the PDB files for each of these proteins.

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
print my_prot.Code # prints "1aq1"
# my_prot.PDBInfo has the behaviour of a file handle
pdb_lines = my_prot.PDBInfo.readlines()# Reads the lines of the file

There, you’ve made a queryable database, where Django deals with all the hard stuff and everything is native to python. Obviously in this example it might not be so difficult to imagine alternative ways of creating the same thing using directory structures, but as the structure of your data becomes more complex, Django can be easily manipulated and as it grow it utilises the speed advantages of modern databases.

Making Protein-Protein Interfaces Look (decently) Good

This is a little PyMOL script that I’ve used to draw antibody-antigen interfaces. If you’d like a commented version on what each and every line does, contact me! This is a slight modification of what has been done in PyMOL Wiki.

load FILENAME
set_name FILENAME, complex	

set bg_rgb, [1,1,1]  	

color white 	     		

hide lines
show cartoon

select antibody, chain a
select antigen, chain b

select paratopeAtoms, antibody within 4.5 of antigen 
select epitopeAtoms, antigen within 4.5 of antibody

select paratopeRes, byres paratopeAtoms
select epitopeRes, byres epitopeAtoms

distance interactions, paratopeAtoms, epitopeAtoms, 4.5, 0

color red, interactions
hide labels, interactions

show sticks, paratopeRes
show sticks, epitopeRes

set cartoon_side_chain_helper, on

set sphere_quality, 2
set sphere_scale, 0.3
show spheres, paratopeAtoms
show spheres, epitopeAtoms
color tv_blue, paratopeAtoms
color tv_yellow, epitopeAtoms

set ray_trace_mode, 3
unset depth_cue
set specular, 0.5

Once you orient it to where you’d like it and ray it, you should get something like this.

Constrain a PDB to particular chains

In many applications you need to constrain PDB files to certain chains. You can do it using this program.

A. What does it do?

Given a pdb file, write out the ATOM and HETATM entries for the supplied chain(s).

PDB_constrain needs three arguments:

PDB file to constrain.
Chains from the pdb file to constrain.
Output file.

B. Requirements:

Biopython – should be installed on your machines but in case you want to use it locally, download the latest version into the PDB_constrain.py’s directory (don’t need to build).

C. Example use:

C.1 Constrain 1A2Y.pdb to chains A and B – write results in constr.pdb

python PDB_constrain.py -f 1A2Y.pdb -c AB -o const.pdb

C.2 Constrain 1ACY to chain L, write results in const.pdb – this example shows that the constrainer works well with ‘insertion’ residue numbering as in antibodies where you have 27A, 27B etc.

python PDB_constrain.py -f 1ACY.pdb -c L -o const.pdb

Making small molecules look good in PyMOL

Another largely plagiarized post for my “personal notes” (thanks Justin Lorieau!) and following on from the post about pretty-fication of macromolecules. For my slowly-progressing confirmation report I needed some beautiful small molecule representation. Here is some PyMOL code:

show sticks
set ray_opaque_background, off
set stick_radius, 0.1
show spheres
set sphere_scale, 0.15, all
set sphere_scale, 0.12, elem H
color gray40, elem C
set sphere_quality, 30
set stick_quality, 30
set sphere_transparency, 0.0
set stick_transparency, 0.0
set ray_shadow, off
set orthoscopic, 1
set antialias, 2
ray 1024,768

And the result:

Beautiful, no?

Viewing ligands in twilight electron density

In this week’s journal club we discussed an excellent review paper by E. Pozharski, C. X. Weichenberger and B. Rupp investigating crystallographic approaches to protein-ligand complex elucidation. The paper assessed and highlighted the shortcomings of deposited PDB structures containing ligand-protein complexes. It then made suggestions for the community as a whole and for researchers making use of ligand-protein complexes in their work.

The paper discussed:

The difficulties in protein ligand complex elucidation
The tools to assess the quality of protein-ligand structures both qualitative and quantitative
The methods used describing their analysis of certain PDB structures
Some case studies visually demonstrating these issues
Some practical conclusions for the crystallographic community
Some practical conclusions for non-crystallographer users of protein-ligand complex structures from the PDB

The basic difficulties of ligand-protein complex elucidation

Ligands have less than 100% occupancy – sometimes significantly less and thus will inherently show up less clearly in the overall electron density.
Ligands make small contributions to the overall structure and thus global quality measures , such as r-factors, will be affected only minutely by the ligand portion of the structure being wrong
The original basis model needs to be used appropriately. The r-free data from the original APO model should be used to avoid model bias

The following are the tools available to inspect the quality of agreement between protein structures and their associated data.

Visual inspection of the Fo-Fc and 2Fo-Fc maps,using software such as COOT, is essential to assess qualitatively whether a structure is justified by the evidence.
Use of local measures of quality for example real space correlation coefficients (RSCC)
Their own tool, making use of the above as well as global quality measure resolution

Methods and results

In a separate publication they had analysed the entirety of the PDB containing both ligands and published structure factors. In this sample they demonstrate 7.6% had RSCC values of less than 0.6 the arbitrary cut off they use to determine whether the experimental evidence supports the model coordinates.

Figure to show an incorrectly oriented ligand (a) and its correction (b)

An incorrectly oriented ligand (a) and its correction (b). In all of these figures Blue is the 2mF_oDF_c map contoured at 1σ and Green and Red are positive and negative conturing of the mF_oDF_c map at 3σ

In this publication they visually inspected a subset of structures to assess in more detail how effective that arbitrary cutoff is and ascertain the reason for poor correlation. They showed the following:

(i) Ligands incorrectly identiﬁed as questionable,false positives(7.4%)
(ii) Incorrectly modelled ligands (5.2%)
(iii) Ligands with partially missing density (29.2%).
(iv) Glycosylation sites (31.3%)
(v) Ligands placed into electron density that is likely to
originate from mother-liquor components
(vi) Incorrect ligand (4.7%)
(vii) Ligands that are entirely unjustiﬁed by the electron
density (11.9%).

The first point on the above data is that the false-positive rate using RSCC of 0.6 is 7.4%. This demonstrates that this value is not sufficient to accurately determine incorrect ligand coordinates. Within the other categories all errors can be attributed to one of or a combination of the following two factors:

The inexperience of the crystallographer being unable to understand the data in front of them
The wilful denial of the data in front of the crystallographer in order that they present the data they wanted to see

Figure to show a ligand placed in density for a sulphate ion from the mother liquor (a) and it's correction (b)

A ligand incorrectly placed in density for a sulphate ion from the mother liquor (a) and it’s correction (b)

The paper observed that a disproportionate amount of poor answers was derived from glycosylation sites. In some instances these observations were used to inform the biochemistry of the protein in question. Interestingly this follows observations from almost a decade ago, however many of the examples in the Twilight paper were taken from 2008 or later. This indicates the community as a whole is not reacting to this problem and needs further prodding.

Figure to show an incomplete glycosylation site inaccurately modeled

Conclusions and suggestions

For inexperienced users looking at ligand-protein complexes from the PDB:

Inspect the electron density map using COOT if is available to determine qualitatively is their evidence for the ligand being there
If using large numbers of ligand-protein complexes, use a script such as Twilight to find the RSCC value for the ligand to give some confidence a ligand is actually present as stated

For the crystallographic community:

Improved training of crystallographers to ensure errors due to genuine misinterpretation of the underlying data are minimised
More submission of electron-density maps, even if not publically available they should form part of initial structure validation
Software is easy to use but difficult to analyse the output

Oxford Protein Informatics Group

or "OPIG" to friends