Author Archives: cristian regep

Faster FREAD with Pandas

One of the things I like to do is to scale up things using the ridiculous amount of cores at my disposal (sometimes even for a good reason). One of these examples is when I had to model millions of CDRs (or loops) using FREAD.

The process through which you model a loop in Fread is:

  1. Pre-filtering step: Anchor Ca separation and ESST score in between your target and all the templates in the DB. The ones that pass a threshold are saved for step 2
  2. Anchor RMSD test

The major bottleneck for such an analysis is step 1, where most of the templates are filtered out so for step 2 you get a very reduced subset. The data for doing the Anchor Ca separation and ESST score is all stored for each possible template in one row of an sqlite database. So when you do step 1 you will go through each row of this table and calculate the score, with the database is stored on the hard drive so costly I/O. This is fine for the original purpose of Fread, where you filled in a missing loop for one structure, but when you are doing it for 100 million examples, going through a table stored on a hard drive 100 million times, sequentially, it is going to be SLOW. I say sequentially because for the python implementation using sqlite3 I had a lot of trouble trying to use a db handle on multiple threads, or load the same sql file on different instances on threads, it just crashes for no good reason. There has been chat about this on stackoverflow and I think this has moved on since I implemented it in 2015. Nevertheless, I wanted a simple and clean solution.

I decided to transform the sqlite3 database into a Pandas object. Pandas objects are basically a convenient way of storing tables with methods available that mimick conventional querying mechanisms for databases. These are stored in memory, easily dumped as pickle files, and can be easily duplicated between threads so no issues with thread safety. Obviously you need to have enough memory to store all of that, but for my application that was not a problem. Below is some sample code on how I used it to transform the template DB from FREAD.

import pandas as pd
import sqlite3 as sql

rows = []

# connect to your fread sql file
conn = sql.connect("fread_sql_file.sql")
    query = "SELECT dihedral, sequence, pdbcode, start, anchor, bound FROM loops"
    results = conn.execute(query)
    for row in results:
        # store the rows as a list of dictionaries
        rows.append(dict(zip(['dihedral', 'length', 'pdbcode', 'anchor', 'sequence', 'start', 'bound'], [row[0], len(row[1]), row[2], row[4] ,row[1], row[3], row[5]])))
except Exception as e:
    print "Error during query", str(e)

# create a pandas dataframe from the list of dictionaries 
df = pd.DataFrame(rows)
# store the table as a pickle file which you can reload later (this is very fast!)

After running this you will have your sql database as a pandas dataframe, and you can write methods which are thread safe to model loops as below:

import pandas as pd

cdr_db pd.read_pickle("fread_pandas_file.sqlite")

def model_loop(query_sequence, query_anchors_ca):
    # score_sequence_db_helper is your function that attaches a scores based on your query sequence and the row in the template db
    scores = cdr_db.apply(lambda row: score_sequence_db_helper(row, query_sequence, query_anchors_ca), axis=1)

    # attach the score
    results = zip(list(cdr_db['pdbcode']), scores, list(cdr_db['sequence']))

    # keep the ones that are over the threshold
    results = filter(lambda (pdb_code, score, sequence): score>=THRESHOLD, results)
    return results



Rational Design of Antibody Protease Inhibitors

On the theme of my research area and my last presentation I talked at group meeting about a another success story in structurally designing an antibody by replicating a general protein binding site using grafted fragments involved in the original complex. The paper by Liu et al is important to me for two major reasons . Firstly they used an unconventional antibody for protein design, namely a bovine antibody which is known to have an extended CDR H3. Secondly the fragment was not grafted at the anchor points of the CDR loop.

Screen Shot 2016-07-06 at 10.28.03

SFTI-1 is a cyclic peptide and a known trypsin inhibitor. It’s structure is stabilised by a disulphide bridge. The bovine antibody is known to have an extended H3 loop which is essentially a long beta strand stalk with a knob domain at the end. Liu et al removed the knob domain and a portion of the B strand and grafted the acyclic version of the SFTI-1 to it. As I said above this result is very important because it shows we can graft a fragment/loop at places different then the anchor points of the CDR. This opens up the possibility for more diverse fragments to be grafted because of new anchor points,  and also because the fragment will sit further away from the other CDRs and the framework allowing more conformational space. To see how the designed antibody compares to the original peptide they measured the Kd and found a 4 fold increase (9.57 vs 13.3). They hypothesise that this is probably due to the fact that the extended beta strand on the antibody keeps the acyclic SFTI-1 peptide in a more stable conformation.

The problem with the bovine antibody is that if inserted in a human subject it would probably elicit an immune response from the native immune system. To humanise this antibody they found the human framework which shares the greatest sequence identity to the bovine antibody and then grafted the fragment on it. The human antibody does not have an extended CDR H3 and to decide what is the best place of grafting they tried various combinations again showing again that the fragments do not need to grafted exactly at the anchor points. Some of the resulting  human antibodies showed even more impressive Kds.

Screen Shot 2016-07-06 at 10.54.24

The best designed human antibody had a 0.79nM Kd, another 10-fold gain . Liu et al hypothesised that this might be due to the fact that the cognate protein forms contacts with residues on the other CDRs even though there is no crystal structure to show this. In order to test this hypothesis they mutated surface residues on the H2 and L1 loop to Alanine which resulted in a 6.7 fold decrease in affinity. The only comment I would have to this is that the mutations to the other CDRs might have destabilized the other CDRs on the antibody which could be the reason for the decrease in affinity.

Inserting functional proteins in an antibody

At the group meeting on the 3rd of February I presented the results of the paper “A General Method for Insertion of Functional Proteins within Proteins via Combinatorial Selection of Permissive Junctions” by Peng et. al. This is interesting to our group, and especially to me, because this is a novel way of designing an antibody, although I suspect that the scope of their research is much more general, their use of antibodies being a proof of concept.

Their premise is that the structure of a protein is essentially secondary structures and tertiary structure interconnected through junctions. As such it should be possible to interconnect regions from different proteins through junctions, and these regions should take up their native secondary and tertiary structures, thus preserving their functionality. The question is what is a suitable junction? ThisScreen Shot 2016-02-03 at 14.37.34 is important because these junctions should be flexible enough to allow the proper folding of the different regions, but also not too flexible as to have a negative impact on stability. There has been previous work done on trying to design suitable junctions, however the workflow presented in this paper is based on trying a vast number of junctions and then identifying which of them work.

As I said above their proof concept is antibodies. They used an antibody scaffold (the host), out of which they removed the H3 loop and then fused to it, using  junctions, two different proteins: Leptin and FSH (the guests). To identify the correct junctions they generated a library of antibodies with random three residues sequences on either side of the inserted protein plus a generic linker (GGGGS) that can be repeated up to three times.Screen Shot 2016-02-03 at 15.11.41

They say that the theoretical size of the library is 10^9 (however I would say it is 9*20^6), and the actually achieved diversity of their library was of size 2.88*10^7 for Leptin and 1.09*10^7. Next step is to identify which junctions have allowed the guest protein to fold properly. For this they devised an autocrine-based selection method using engineered cells that have beta lactamase receptors which have either Leptin or FSH as agonists. A fluoroprobe in the cell responds to the presence of beta lactamase producing a blue color, instead of green and therefore this allows the cells with the active antibody-guest  designed protein (clone) to be identified using FRET-based fluorescence-activated cell sorting.

They managed to identify 6 clones that worked for Leptin and 3 that worked for FSH with the linkers being listed in the below table: Screen Shot 2016-02-03 at 15.49.03

There does not seem to be a pattern emerging from those linker sequences, although one of them repeats itself. For my research it would have been interesting if a pattern did emerge, and then that could be used as a generic linker for future designers. However, this is still another prime example of how Screen Shot 2016-02-03 at 16.05.38well an antibody scaffold can be used a starting point for protein engineering.

As a bonus they also tested in vivo how their designs work and they discovered that the antibody-leptin design (IgG-Leptin) has a longer lifetime. This is probably due to the fact that being a larger protein this is not filtered out by the kidneys.

Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).


Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.


Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.


Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated  and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.


[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred  superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here

python -m cProfile

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm)  and  interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python.  The way you set this up is:

C code rmsd.cpp

extern “C” {
    void superpose_c(const char* file1, consta char* file2, double* rmsd){
        //perform superposition
        *rmsd = compute_rmsd();

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./’)

#convert python string to c char array function
def to_c_string(py_string)
    c_string = c_char * (len(py_string)+1)()
    for i in range(len(py_string)):
        c_string[i] = py_string[i]
    c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
    //Unsafe operation

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.