Making the most of your CPUs when using python

Over the last decade, single-threaded CPU performance has begun to plateau, whilst the number of logical cores has been increasing exponentially.

Like it or loathe it, for the last few years, python has featured as one of the top ten most popular languages [tiobe / PYPL].   That being said however, python has an issue which makes life harder for the user wanting to take advantage of this parallelism windfall.  That issue is called the GIL (Global Interpreter Lock).  The GIL can be thought of as the conch shell from Lord of the Flies.  You have to hold the conch (GIL) for your thread to be computed.  With only one conch, no matter how beautifully written and multithreaded your code, there will still only be one thread will be executed at any point in time.

Best case scenario for multithreaded python : If your code has a large number of wait states in it, for example, if you are writing a program which makes HTTP requests for websites, but where waiting for the website to download is the slowest bit, multithreading is for you.  A thread which is entering a wait state can voluntarily release the GIL so that another thread can take over.  Adapted from David Beazley’s excellent talk,  the image below gives an example of the cooperative multitasking which python can provide.

However do please note that whilst the I/O can over lap, the arrows representing processing never do, the system just switches between them.  The solution to this is multiprocessing.  Multiprocessing effectively side-steps the Global Interpreter Lock by using subprocesses instead of threads. A process has its own virtual address space, executable code, open handles to system objects, environment variables and at least one thread of execution.  Each process gets its own GIL and so the multiprocessing module allows you to fully avail of multiple processors on a given machine.

A simple serial program is shown below, which we will parallelise shortly.  The program simply takes a long list of numbers and using two functions squares and cubes those values, then prints the sum of the cubes and the length of time it took to execute.

#! /usr/bin/env python3
import time

def calc_square(numbers):
    for n in numbers:
            sq = n*n
            #print(n,"squared: ",sq)

def calc_cube(numbers):
    sum = 0
    for n in numbers:
        cu = n*n*n
        #print(n,"cubed: ",cu)
        sum += cu
    print("cube sum: ", sum)

arr = [2,4,8,16,32,64]*1000000

t=time.time()
calc_square(arr)
calc_cube(arr)

print("This took", time.time()-t, "seconds")

The above code gives the output:

cube sum: 299592000000
This took 5.7916646003723145 seconds

The multiprocessed implementation of this is remarkably similar:

#!/usr/bin/env python3
import time
from multiprocessing import Process, Value

threads = 2 # Number of threads to create
cube_sum = 0
arr = [2,4,8,16,32,64]*1000000

def calc_square(numbers):
    for n in numbers:
            sq = n*n
            #print(n,"squared: ",sq)

def calc_cube(numbers):

    global cube_sum
    local_sum = 0

    for n in numbers:
        cu = n*n*n
        #print(n,"cubed: ",cu)
        #use a vaiable local to this process
        local_sum += cu

        #If we didn't keep a local variable, we'd get bad performance due
        #to the loops fighting over the lock all the time.

        #with cube_sum.get_lock():
        #    cube_sum.value += local_sum

    #By using "with", the lock is automatically released
    #I bet nobody reads these bloody things
    with cube_sum.get_lock():
        cube_sum.value += local_sum

    print("cube sum:", cube_sum.value)

if __name__ == "__main__":
    #Create a shared ctype variable of type double
    cube_sum = Value('d', 0)

    #alternatively, we could import an array and have:
    #my_arr = Array('i', range(10))
    t=time.time()

    p0 = Process(target=calc_square,args=(arr,))
    p1 = Process(target=calc_cube,args=(arr[:len(arr)//2],))
    p2 = Process(target=calc_cube,args=(arr[len(arr)//2:],))

    #for better performance have one process (p1) for the first half of the list
    #and p2 for the second half of the list

    #start processing
    p0.start()
    p1.start()
    p2.start()
   
    #block until processing's completed
    p0.join()
    p1.join()
    p2.join()

    print("Multiprocessed took",time.time()-t, "seconds.")

On a dual-core device the above code gives the following output:

cube sum: 149796000000.0
cube sum: 299592000000.0
Multiprocessed took 2.8858437538146973 seconds.

Please note the two sum lines, one from each process.  Each process has calculated its own part of the sum and added it to the current sum value.  One important fact is the functions themselves have barely been touched.  The main difference between this parallel version and the serial version is the way the functions are started and locking “the critical section”.

In a multiprocessed program, each process has its own copy of the variables, so to coordinate results some form of IPC (Inter Process Communication) is required.  One thing that can bite you is that the increment operation is not atomic, it’s implemented as four operations. x += 1 breaks down to four python opcodes:

load x load 1 add store result in x.

In both multithreaded and multiprocessed code, an attempt to increment a shared variable can give unpredictable results.   Thread A may be about to write its results back into variable X just as thread B starts an increment operation of the same variable.  As far as thread B is concerned it’ll never see the new value A puts into the variable and as soon as B finishes, it’ll overwrite the value thread A produced, making it appear as if thread A never did anything at all.

An example of what NOT to do with shared variables is shown below:

#! /usr/bin/env python3
import time
from threading import Thread, Lock

inc_value = 0
n_iters = 50000
lock = Lock()

def increment_o_matic(n_iters):
    global inc_value
    for _ in range(n_iters):
        #lock.acquire()
        inc_value += 1
        #lock.release()

t=time.time()
t1 = Thread(target=increment_o_matic,args=(n_iters,))
t2 = Thread(target=increment_o_matic,args=(n_iters,))

t1.start()
t2.start()
t1.join()
t2.join()

print("after running, inc_value =", inc_value)
print("This took", time.time()-t, "seconds")

 

The above code has two threads, each of which run 50,000 times and which increment the shared variable inc_value.  You’d assume that at the end of this code, inc_value would equal 100,000.  However, if you run this code, inc_value can give some very different results:

after running, inc_value = 90310 This took 0.05303668975830078 seconds after running, inc_value = 86295 This took 0.057953596115112305 seconds

If you’re running the code on a relatively fast machine, you will probably need to tweak the n_iters value to be much higher than 50,000 to see results similar to those above.  Code which should only be updated by a single process at a time isn’t thread safe and is referred to as the “critical section”.  If the lock.acquire() and lock.release() lines have their comments removed, the code will behave as expected.

In short, from multiprocessing import Process, Value is your friend and can be used to trivially take advantage of multiprocessor systems.