Lightning-fast Python code

Scientific code is never fast enough. We need the results of that simulation before that pressing deadline, or that meeting with our advisor. Computational resources are scarce, and competition for a spot in the computing nodes (cough, cough) can be tiresome. We need to squeeze every ounce of performance. And we need to do it with as little effort as possible.

There are plenty of techniques to make code run faster. These range from identifying which parts of your code take the most time, to just-in-time compilation of functions, to exploiting special features of your microprocessor to minimise overheads. In the era of multicore processors, there are several frameworks for parallel computing. And, in the last years, there has been a growth in special-purpose hardware, such as widely known graphical-processing units (GPUs), and less widely-known field-programmable gate arrays (FPGAs).

Most of these techniques are complicated and sometimes have relatively large barriers of entry. Many people have built successful, well-remunerated careers in the field of high-performance computing, deploying these relatively uncommon skills. Unless your project really requires, for example, learning to write CUDA-optimised code, it is unlikely that you will reap many benefits from the initial time investment. Instead, in standard scientific software development, the most common solution to make a code run faster is to use a low-level programming language

Low-level languages have plenty of inconveniences. If you have spent some time working with languages like C/C++ or FORTRAN, you might now be grimacing at the perspective of writing yet another endless parsing subroutine. Fortunately for you, there are alternatives available. Multiple Python libraries that allow you to use low-level languages for the computational hotspots that take most of the time, while enjoying all the simplicity of Python for everything else.

Consider the following piece of Python code, where we sum all the values of an array of random numbers:

import time
import numpy as np

np.random.seed(42)
array = np.random.random(50000000)

start = time.time()
total = 0.0
for i in range(50000000):
    total += array[i]
end = time.time() - start

print("Sum: %f, in %f s" % (total, end))

When I run this piece of code in my workstation at the Department of Statistics, the output is:

Sum: 24999263.512460, in 17.380054 s

Seventeen seconds to add up a billion numbers is not bad, but we certainly can do much better than this. Consider the following C function that emulates the behaviour of our loop:

double sum(double* array, int n_elements) {
   double total = 0.0;
   for (int i=0; i<n_elements; i++) {
       total += array[i];
   }
   return total;
}

In order to import this function into our Python code, we need to compile it as a shared library:

cc -fPIC -shared -o sum_c.so sum.c

We will use the ctypes library from the Python Standard Library to load this function. This library provides functions and objects to call C code from Python, and to transform Python objects into C-compatible types. For example, we can read the sum function from the C shared library using the CDLL parser:

c_lib = ctypes.CDLL("sum_c.so")
c_fun = c_lib.sum

A very important remark is that ctypes assumes that functions return integer values. We can modify this with the restype attribute of the C function object:

c_fun.restype = ctypes.c_double

If you try to run this function as is, you will likely get some kind of Segmentation fault (ah, is it not a pleasure to work with low-level programming languages?). The way ctypes is built requires you to transform your Python variables to C-compatible types. In our case, we can do this with the following line:

total = c_fun(ctypes.c_void_p(array.ctypes.data),
              ctypes.c_int(50000000))

For example, we transform our integer 50000000 to a C integer using ctypes.c_int. The case of the NumPy array is slightly more complicated. We extract the numbers from the NumPy array, accessing the ctypes.data attribute of this object. Since this value is passed as a pointer to the C function, we generate a pointer using the ctypes.c_void_p data type.

Putting all of this together, we can write the following Python code that runs and times the C function we have written:

import time
import ctypes
import numpy as np

np.random.seed(42)
array = np.random.random(50000000)
c_fun = ctypes.CDLL("sum_c.so").sum
c_fun.restype = ctypes.c_double

start = time.time()
total = c_fun(ctypes.c_void_p(array.ctypes.data),
              ctypes.c_int(50000000))
end = time.time() - start

print("Sum: %f, in %f s" % (total, end))

After execution, we obtain:

Sum: 24999263.512460, in 0.165089 s

The C function is approximately 100 times faster than the original Python code. Of course, this is extremely dependent on the optimisation options that we have used in our compiler. I observed that compiling the library with the -O2 flag reduces the execution time by about half.

Another greatly esteemed language in high-performance scientific programming is, unfortunately, FORTRAN. You might have heard of FORTRAN for many reasons. You might have learned to code a long, long time ago. You might be a fan of vintage programming languages. Or, more likely (many of the folks above already moved to C a while ago), you have some relation to computational chemistry. I understand, many of us have been there. And, because of that, I am also going to tell you how to import FORTRAN code into your Python script.

Consider the following FORTRAN subroutine that emulates our summation loop:

SUBROUTINE sum(array, n_elements, total)
    REAL, intent(in) :: array(:)
    INTEGER, intent(in) :: n_elements
    REAL, intent(out) ::total
    total = 0
    DO i = 1, n_elements, 1
       total = total + array(i)
    END DO
    RETURN
END SUBROUTINE

We can transform this code into a Python function using the Fortran to Python interface generator (f2py) that comes with NumPy. This is even easier than ctypes, just requiring you to run:

f2py -c sum.f90 -m f_sum

The .so object created can be imported directly from Python, and executed as in the following piece of code:

import time
import numpy as np
from f_sum import sum as f_fun

np.random.seed(42)
array = np.random.random(50000000)

start = time.time()
total = f_fun(array, 50000000)
end = time.time() - start

print("Sum: %f, in %f s" % (total, end))

The f2py compiler already takes care that you can pass NumPy arrays

Sum: 24999263.512460, in 0.060562 s

Not bad at all. We have squeezed another quarter of an order of magnitude of performance: about 25 times faster than the C code, and 250 times faster than the original Python code.

We are almost done. I have shown you how to import C and FORTRAN code in Python, using essentially “default” Python packages. However, there is one last question: how does this compare to default Python routines

sum(array)
Sum: 24999263.512460, in 9.282920 s

More interestingly, consider the specific np.sum method:

np.sum(array)
Sum: 24999263.512456, in 0.041070 s

The one-liner np.sum is about 30% faster than FORTRAN in execution time. More importantly, it is faster in developer’s time. And it leads to one, very important conclusion of scientific programming: if you want to make something fast and efficient, oftentimes you should just start by reading the instructions.

Author