Quickly (and lazily) scale your data processing in Python

Do you use pandas for your data processing/wrangling? If you do, and your code involves any data-heavy steps such as data generation, exploding operations, featurization, etc, then it can quickly become inconvenient to test your code.

  • Inconvenient compute times (>tens of minutes). Perhaps fine for a one-off, but over repeated test iterations your efficiency and focus will take a hit.
  • Inconvenient memory usage. Perhaps your dataset is too large for memory, or loads in but then causes an OOM error during a mid-operation memory spike.

I write inconvenient because these issues are not a huge obstacle to deal with nor require any ‘big guns’ such as Apache spark, but they can be time-consuming enough to break your focus as you subsample your dataset or optimise your script. They are especially inconvenient if you are not working within a computer cluster. So, here I’m going to describe three quick and dirty ways to push back the creep of compute-time and memory usage in your pandas-based pipeline:

#1: Use Modin if your scripts run for an inconveniently long time.

There are quite a few ‘new generation’ replacements for pandas out there. Modin is one of them, and its claim to fame is the ease at which it can be implemented to speed up your experimental workflows:

Just replace

import pandas as pd

with

import modin.pandas as pd

and you will automatically utilise multiple available cores. Depending on the quantity and type of operations you use, this could drastically reduce your compute time without actually requiring you to use your brain.

#2: Use Polars if your data takes up an inconvenient amount of memory.

Polars is my favourite pandas-substituting library, although it doesn’t always use *exactly* the same function names as pandas like modin does. I’m suggesting it here because of how efficiently it handles memory compared to pandas. First off, polars generally exhibits smaller memory spikes than pandas — typically 2-4X your data size, compared to pandas’ 5-10X spikes. This is baked into polars and requires nothing more than switching libraries and commands. However, if your data still proves too excessive for your memory constraints, then they provide access to a fairly sophisticated streaming API with more functionality than using the chunksize arg in pandas. You can stream your dataset through your script, performing operations on the fly without running out of memory.

Streaming with polars will require a little more effort on your part, but, to get you started, begin a polars query with a pl.scan_csv(), use your desired methods to do things (e.g. filter or replace values), and then call it with ‘collect‘, passing ‘streaming=True‘.

query = pl.scan_csv('./your_input.csv').do_thing1().do_thing2()
result = query.collect(streaming=True)

This is only effective, though, if you’re performing operations that don’t require the whole dataset to be loaded into memory at once – if you are, then you should first…

#3: Divide and hash if you need to work completely in memory

Some operations are exceedingly memory hungry. Duplicate removal is an example of an operation that requires all your data be loaded into memory, unless you want to wait a long time for your operation to finish with disk overflow. Such operations can sometimes be made feasible if your data is hashable. Import xxhash and split and hash your column(s) using an appropriate hash size. Hopefully, you can now load all, or at least a lot more, of your data into memory. Don’t forget to first calculate how probable collisions are and if you can tolerate that probability.

There are a lot of ‘proper’ ways to make your pipeline faster and more memory efficient, but they’re more time-consuming than just slapping on modin or polars and calling it a day. Sadly, while these fixes take very little time to implement, they won’t work as a substitute for not writing horribly inefficient code. Probably.

Author