Category Archives: macOS

Finding 250GB of Missing Storage On My Mac: A Warning For Large Dataset Users

I recently faced a puzzling issue: my 1TB MacBook Pro showed only 150GB free, but disk analyzers could only account for about 500GB of used space. After hours of troubleshooting, I discovered that Spotlight’s search index had balooned to 233GB, hundreds of times larger than normal.

The Problem

Standard disk analyzers showed that my mac had 330GB of “Inaccessible Disk Space” and 66GB of “Purgeable Disk Space” but no clear explanation for where my storage went. Removing the purgeable space was easy enough with sudo purge but none of the recommended fixes from ChatGPT like clearing Time Machine snapshots, clearing unused conda packages with pip cache purge and conda clean --all, and restarting the computer had any effect on the inaccessible disk space.

Continue reading →

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.

Next time you need to create a pip installable package yourself, hopefully this can save you some time!

Continue reading →

Converting or renaming files, whilst still maintaining the directory structure

For various reasons we might need to convert files from one format to another, for instance from lossless FLAC to MP3. For example:

ffmpeg -i lossless-audio.flac -acodec libmp3lame -ab 128k compressed-audio.mp3

This could be any conversion, but it implies that the input file and the output file are in the same directory. What if we have a carefully curated directory structure and we want to convert (or rename) every file within that structure?

find . -name “*.whateveryouneed” -exec somecommand {} \; is the tool for you.

Continue reading →

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading →

Unreasonably faster notes, with command-line fuzzy search

A good note system should act like a second brain:

Accessible in seconds
Adding information should be frictionless
Searching should be exhaustive – if it’s there, you must find it

The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.

But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.

Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.

Using *Fuz* to quickly add a code-snippet in our note directory – then retrieving it with fuzzy-search. Here, on how to read FASTA files with Biopython, conveniently added to a file called biopython.py.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: macOS

Finding 250GB of Missing Storage On My Mac: A Warning For Large Dataset Users

The Problem

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Converting or renaming files, whilst still maintaining the directory structure

Pairwise sequence identity and Tanimoto similarity in PDBbind

Unreasonably faster notes, with command-line fuzzy search