# From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred  superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm)  and  interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python.  The way you set this up is:

C code rmsd.cpp

extern “C” {
void superpose_c(const char* file1, consta char* file2, double* rmsd){
//perform superposition
*rmsd = compute_rmsd();
}
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects

#convert python string to c char array function
def to_c_string(py_string)
c_string = c_char * (len(py_string)+1)()
for i in range(len(py_string)):
c_string[i] = py_string[i]
c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
output_mtx.lock();
//Unsafe operation
files.append(filename);
output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.