Handling OAS Scale Datasets Without The Drama | Oxford Protein Informatics Group

Working with Observed Antibody Space (OAS) dataset sometimes feels a bit like trying to cook dinner with the contents of the whole fridge emptied into the pan. There are countless CSVs, all of different sizes (some might not even fit onto your RAM), and you just want a clean, fast pipeline so you can get back to modelling. The trick is to stop treating the data like a giant spreadsheet you fully load into memory and start treating it like a columnar, on-disk database you stream through. That’s exactly what the 🤗 Datasets library gives you.

At the heart of 🤗 Datasets is Apache Arrow, which stores columns in a memory-mapped format (if you are curious about what that means there is a great explanation in another blog post here. In plain terms: the data mostly lives on disk, and you pull in just the slices you need. It feels interactive even when the dataset is huge. Instead of a single monolithic script that does everything (and takes forever), you layer small, composable steps—standardize a few columns, filter out junk, compute a couple of derived fields—and each step is cached automatically. Change one piece, and only that piece recomputes. Sounds great, right? But of course, the key question now is how to get OAS data into Datasets to begin with.

With OAS, the flow typically starts from getting your hands on the raw files that you will need, you can use the OAS website for that. In practice it could looks something like this:

raw_data_links = [
"https://opig.stats.ox.ac.uk/webapps/ngsdb/unpaired/Briney_2019/csv/SRR8283606_Heavy_IGHM.csv.gz",
"https://opig.stats.ox.ac.uk/webapps/ngsdb/unpaired/Briney_2019/csv/SRR8283603_1_Heavy_IGHM.csv.gz",
]

These files alone contain about 8 million unique sequences along with a bunch of metadata, reading them directly with Pandas might cause out-of-memory issues. This is where we want to be smart and do processing step by step. First of all, we will get functions to download and uncompress these files:

 def download_file(url: str, temp_dir: str) -&gt; str:
        """Download file from URL with retry logic."""
        path = urlsplit(url).path    
        filename = PurePosixPath(path).name
        filepath = os.path.join(temp_dir, filename)
        
        for attempt in range(3):
            try:
                response = requests.get(url, stream=True, timeout=30)
                response.raise_for_status()
                
                with open(filepath, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        if chunk:
                            f.write(chunk)
                
                return filepath
            except Exception as e:
                if attempt &lt; 2:
                    continue
                raise

def decompress_file(compressed_path: str) -&gt; str:
        """Decompress gzipped file."""
        if not compressed_path.endswith('.gz'):
            return compressed_path
        
        decompressed_path = compressed_path[:-3]
        
        with open(decompressed_path, 'wb') as output_file:
            _ = subprocess.run(
                ['gunzip', '-c', compressed_path],
                stdout=output_file,
                stderr=subprocess.PIPE,
                check=True,
                timeout=300
            )
        
        return decompressed_path

Now we are ready to start building your Datasets dataset 🙂 .

def process_file(raw_data_path: str) -&gt; Generator[Dict, None, None]:
    with tempfile.TemporaryDirectory() as temp_dir:
        local_compressed_path = download_file(raw_data_path, temp_dir)
        csv_path = decompress_file(local_compressed_path)
        # We read the csv in chunks to avoid running out of memory
        for chunk in pd.read_csv(csv_path, header=1, chunksize=10_000): 
            for _, row in chunk.iterrows():
                # you can do some additional processing here or not, you're the boss
                yield {'cdr3_aa': row.get('cdr3_aa', '')}

def process_files(urls: Iterable[str]) -&gt; Generator[Dict[str, Any], None, None]:
    for url in urls:
        yield from process_file(url)

features = Features({
    "cdr3_aa": Value("string"),
)}

dataset = Dataset.from_generator(process_files,
    gen_kwargs={"urls": raw_data_links},
    features=features,          # helps Arrow pick efficient types
)

dataset.save_to_disk("oas_min.arrow")

Yielding rows into 🤗 Datasets means Arrow handles the storage and you get zero-copy slices later. Small, named transforms keep the pipeline understandable and make it trivial to change thresholds without rebuilding the world. The end result is a dataset you can share and reuse, with the heavy lifting already done.That’s the ingest step done: memory stays flat, and you already have an Arrow-backed dataset. From here, the fun parts are just small, composable transforms.

Figuring out the imports for the code to run is left as an exercise to the reader.

Author

Marius Urbonas

View all posts