Make your code do more, with less

When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.

Even quick calculations, when applied to tens of millions of sequences, can take quite some time!

This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.

The golden rule

We need to start with a simple question that will underpin most of this blogpost: what is the objective of producing faster code?

Many problems require software that scales. Clustering hundreds of millions of DNA sequences, for example, demands a top-notch clustering program that combines clever algorithms with a lot of engineering. Other tasks require repeating a process many times. If you need to exhaustively optimise hyperparameters, implementing fast code that enables rapid prototyping is definitely a good idea. Finally, and I am saying this with a totally straight face, the claim “my program is faster than the incumbent” is a respectable argument to publish your scientific paper (or sell your product).

Whatever your argument, the only valid metric for code optimization is whether you will answer your question more quickly. Any other is procrastination hiding under the guise of personal productivity. In fact, I will go further and argue that, in most cases, the best use of your time is to not optimize your code, or as legendary computer scientist Donald Knuth put it:

Premature optimization is the root of all evil.

Optimization is hard. Optimization takes time. Optimization requires testing that the code behaves well, that it is faster, and that it is as robust as the original. Even when tackling a time-consuming task, it is not uncommon that optimizing the code takes more than solving the problem in the first place. So, before even considering optimization, make sure that you really need it.

Of course, there are many cases when optimization is the right, or the only approach. Yet even in those cases you need to approach the problem from a cold and calculating view, and be strict with what you optimize.

So, how do we do that?

Always measure before you optimize

Rarely, if ever, will you optimise a full piece of code. Collective experience suggests most programs have one, or a small set of bottlenecks that determine the overall speed. So, your first job is to find this bottleneck. And, yes, I am using the singular. Because after you’ve dealt with the first bottleneck you will need to carefully assess whether finding a second one is worth it at all.

And how does one find said bottleneck? There are many options, but only one that leads to a 100% success rate: measure the time taken by the code.

Lucky for us, there exist many nifty programs that can measure the time spent at different parts of your code. These programs are called profilers, and there exist many alternatives depending on the language and level of sophistication. In Python, which is one of the most common languages bioinformaticians are using these days, one particularly user-friendly alternative is line_profiler, which will pinpoint the amount of time required by every line in your code.

Timer unit: 1e-06 s

Total time: 5.6224 s
File: a.py
Function: greyscale_python at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           @profile
     2                                           def greyscale_python(img):
     3         1          4.0      4.0      0.0      rows, cols = img.shape[:2]
     4         1        349.0    349.0      0.0      grey = np.zeros_like(img)
     5                                           
     6      1001        345.0      0.3      0.0      for i in range(rows):
     7   1001000     369725.0      0.4      6.6          for j in range(cols):
     8                                                       grey[i, j] = 0.299 * img[i, j, 0] + 0.587 * img[i, j, 1] \
     9   1000000    5251973.0      5.3     93.4                  + 0.114 * img[i, j, 2]
    10                                           
    11         1          0.0      0.0      0.0      return grey

Stop reinventing the wheel

You have identified the computational bottleneck. Excellent. Time to optimize your code!

Not so fast.

The second most common error in optimising code is spending a long time developing a solution that already exists. Few problems in programming are novel. These days, vast numbers of libraries exist that can solve all kind of problems with a simple import. Even when you can’t just import a library or copy a snippet from StackOverflow, if you know someone has done something, it is definitely a good place to start. So, before you start trying things, look out there and see if you can save yourself some time.

We all make this mistake. I remember, at one stage in my PhD, I was working on a code to generate high-quality protein fragments for fragment-based protein structure prediction. Without entering in details, one of the key steps was to keep a list of the top 50 fragments generated for every position, an operation that can take a surprising amount of time when you are considering hundreds of millions of fragments for every protein.

How had I solved the problem? I had implemented a class that I called a “self-sorting” list. Suppose you sample a fragment C with some arbitrary score. If you want to keep the list sorted, you need to start at one of the ends, and compare the new element with every other fragment until you find the right position. It is not difficult to convince yourself that, making assumptions about the fragments or their scores, it should take an average of ~N/2 comparisons. This is, grosso modo, what computer scientists would describe as an algorithm with computational complexity O(N).

Adding a new fragment to an ordered linked list. The naive approach, which I was using, compares the new element to every element in the list until it finds a position where the new element fits.

As a greenhorn PhD student, I spent days optimising this problem. For example, I quickly realised that most random samples had, well, quite low scores. Therefore, if you start comparing from the end of the list, in most cases the fragment will be rejected immediately. These and other tweaks resulted in significant speedups… at the cost of over a week of my life that I will never get back.

The truth is, I would have solved the problem much more quickly if I had just gone to check an algorithms book. Using a structure called heap, it turns out that the problem can be solved in just O(log N). And, what is a lot more important: efficient implementations are available in any standard libraries. So, instead of spending days testing implementations… I could just have imported a heap data structure, and be done in a few minutes, instead of a few days.

A heap is a tree-like structure that keeps itself sorted whenever an element is added.

How to optimize when you really have to

Suppose you have studied your problem in depth. There is a strong case to optimise a subroutine. There is also not an obvious way to optimise the problem, at least not through a library import or some code copied from Stack Overflow. What do we do?

Well, there is really only one way to speed up a computational task: to reduce the number of instructions executed by the CPU. This leaves you two well-defined options: either you use less CPU instructions (which I like to call performance solutions) or you use more CPUs so that every CPU processes a smaller amount of instructions and thus can finish faster (which I like to call throughput solutions).

Solutions that reduce the number of CPU instructions often pass by rewriting parts of the code in a different language. One of the problems with Python and similar languages like Perl or R is that it is an interpreted language. Simply put, when you run python my_script.py, the Python Interpreter scans every line of your code and performs some operations. This comes at a cost. Much like a simultaneous translator, it cannot anticipate what you are going to do next, and it cannot prevent you from doing something relatively stupid. In contrast, there is another variety of programming languages, so-called compiled languages, where a nifty compiler program scans the entirety of your program.

In practical terms, you can easy gain a couple of orders of magnitude just by using a compiled language. So you could rewrite the bottleneck subroutine in C++ and link it to Python using a special library, such as pybind11. Or you could make away with Python altogether and work with a fully compiled language, or one that, like Julia, uses black magic to compile code as it is interpreted. There is also the possibility to use Python libraries like NumPy, which have already implemented expensive operations in C++ for you, or Numba, which, much like Julia, uses arcane tricks to compile your code on the go.

The other alternative is to extend your code to use multiple processors. In the general case, multiprocessor programming is an intricate discipline that I would not approach without proper study of training. However, there are many sweet spots for parallelization. The key is that many problems in data science are embarrasingly parallel, meaning roughly that you are trying to repeat a procedure on many independent cases. In these cases it is quite easy (see a couple of examples below) to replicate

One final possibility is to take these two approaches on steroids, and completely migrate to a different computing platform. Almost everyone that uses deep learning, for example, is using a GPU, a specialised processor that is quite good at performing the linear algebra calculations required for training neural networks. Extending code to GPUs is now fairly accessible to anyone (see discussion below). However, this has not stoppedsignificant

Some tricks that tend to come up

Enough with the theory. Below, I discuss several tricks that you can use to significantly speed up your Python code.

Extend your code to multiple cores

Despite having taken the programming world over by storm, Python is not free of shortcomings. One highly contended disadvantage is the Global Interpreter Lock (GIL), a feature which, roughly speaking, prevents Python from using more than one processor at a time. Despite community proposals to remove the GIL, it looks like it will stay around for a while still. So, in order to use multiple cores, we need to sidestep the GIL.

The “native” way is to use the multiprocessing library.

import dill as pickle
import pathos.multiprocessing as mp


pool = mp.Pool(12)
pool.map(function_that_solves_a_task, list_of_tasks)

I confess that, if possible, I tend to avoid using the multiprocessing libraries in Python. They are bloated, their error traces are verbose and unhelpful, and overall they are a poor improvement on the limitations of Python’s GIL. I would recommend alternative packages, like joblib or Dask, but quite frankly, I find myself using GNU Parallel for a variety of tasks, since it is independent of the language that you are using.

In simple words, describe your job as a list of operations:

python process_sequence.py seq1.fasta
python process_sequence.py seq2.fasta
python process_sequence.py seq3.fasta

And then pass this list of tasks to GNU Parallel:

$ parallel --jobs 12 << list_of_tasks.sh

Modern processors can take advantage of vectorization

Something I find novice programmers do all the time is to write array operations in terms of loops:

import numpy as np


def greyscale(img):
    rows, cols = img.shape[:2]
    grey = np.zeros_like(img)

    for i in range(rows):
        for j in range(cols):
            grey[i, j] = 0.299 * img[i, j, 0] \
                + 0.587 * img[i, j, 1] \
                + 0.114 * img[i, j, 2]
    return grey

If you time this code, it will take roughly 5.6 s on a 1,000×1,000 image — quite a long time to convert a color image to greyscale. The reason for this delay

If you rewrite it in the following, also much more readable form:

import numpy as np


def greyscale(img):
    return 0.299 * img[:, :, 0] \ 
        + 0.587 * img[:, :, 1] \
        + 0.114 * img[:, :, 2]

You will get, for free, a 500x speedup (0.01 s on a 1,000×1,000 image), thanks to the hard work the NumPy developers put into exploiting vectorization.

Just In Time (JIT) is your friend

Rewriting your Python code in a compiled language is a great idea… if you already know said compiled language, and are willing to spend the time

import numba
import numpy as np

@numba.jit()
def greyscale(img):
    rows, cols = img.shape[:2]
    grey = np.zeros_like(img)

    for i in range(rows):
        for j in range(cols):
            grey[i, j] = 0.299 * img[i, j, 0] \
                + 0.587 * img[i, j, 1] \
                + 0.114 * img[i, j, 2]
    return grey

Running on GPUs was never easier

These days, GPUs are all the rage. These custom processors can offer incredible speedups for problems where massive parallelization is an advantage. Fortunately, you can use

My favourite? JAX, beyond doubt. It just looks like NumPy.

import jax.numpy as np


def greyscale(img):
    return 0.299 * img[:, :, 0] \ 
        + 0.587 * img[:, :, 1] \
        + 0.114 * img[:, :, 2]

There exist other approaches, like Triton, that you might also want to have a look to.

Final thoughts

There is one true objective to code optimization: to do more in less. In that line, if you take only one piece of advice from this post, let it be this: do not optimise your code unless you have verified it is necessary.

This blogpost is based on a lecture I gave at OPIG Retreat last April. I would like to thank Dr. Scott Hale, from the Oxford Internet Institute, for inspiration and examples in this post. In 2021 and 2022 I collaborated with him on the teaching of a MSc-level course, Data Analytics at Scale, a delightful experience that helped mature the ideas that gave rise to this post.

Author