Faster FREAD with Pandas

One of the things I like to do is to scale up things using the ridiculous amount of cores at my disposal (sometimes even for a good reason). One of these examples is when I had to model millions of CDRs (or loops) using FREAD.

The process through which you model a loop in Fread is:

  1. Pre-filtering step: Anchor Ca separation and ESST score in between your target and all the templates in the DB. The ones that pass a threshold are saved for step 2
  2. Anchor RMSD test

The major bottleneck for such an analysis is step 1, where most of the templates are filtered out so for step 2 you get a very reduced subset. The data for doing the Anchor Ca separation and ESST score is all stored for each possible template in one row of an sqlite database. So when you do step 1 you will go through each row of this table and calculate the score, with the database is stored on the hard drive so costly I/O. This is fine for the original purpose of Fread, where you filled in a missing loop for one structure, but when you are doing it for 100 million examples, going through a table stored on a hard drive 100 million times, sequentially, it is going to be SLOW. I say sequentially because for the python implementation using sqlite3 I had a lot of trouble trying to use a db handle on multiple threads, or load the same sql file on different instances on threads, it just crashes for no good reason. There has been chat about this on stackoverflow and I think this has moved on since I implemented it in 2015. Nevertheless, I wanted a simple and clean solution.

I decided to transform the sqlite3 database into a Pandas object. Pandas objects are basically a convenient way of storing tables with methods available that mimick conventional querying mechanisms for databases. These are stored in memory, easily dumped as pickle files, and can be easily duplicated between threads so no issues with thread safety. Obviously you need to have enough memory to store all of that, but for my application that was not a problem. Below is some sample code on how I used it to transform the template DB from FREAD.

import pandas as pd
import sqlite3 as sql

rows = []

# connect to your fread sql file
conn = sql.connect("fread_sql_file.sql")
try:
    query = "SELECT dihedral, sequence, pdbcode, start, anchor, bound FROM loops"
    results = conn.execute(query)
    for row in results:
        # store the rows as a list of dictionaries
        rows.append(dict(zip(['dihedral', 'length', 'pdbcode', 'anchor', 'sequence', 'start', 'bound'], [row[0], len(row[1]), row[2], row[4] ,row[1], row[3], row[5]])))
        
except Exception as e:
    print "Error during query", str(e)
    conn.close()

# create a pandas dataframe from the list of dictionaries 
df = pd.DataFrame(rows)
# store the table as a pickle file which you can reload later (this is very fast!)
df.to_pickle("fread_pandas_file.pickle")

After running this you will have your sql database as a pandas dataframe, and you can write methods which are thread safe to model loops as below:

import pandas as pd

THRESHOLD = 25
cdr_db pd.read_pickle("fread_pandas_file.sqlite")


def model_loop(query_sequence, query_anchors_ca):
    # score_sequence_db_helper is your function that attaches a scores based on your query sequence and the row in the template db
    scores = cdr_db.apply(lambda row: score_sequence_db_helper(row, query_sequence, query_anchors_ca), axis=1)

    # attach the score
    results = zip(list(cdr_db['pdbcode']), scores, list(cdr_db['sequence']))

    # keep the ones that are over the threshold
    results = filter(lambda (pdb_code, score, sequence): score>=THRESHOLD, results)
    
    return results