Category Archives: Python

App-ready databases from Python with SQLModel and FastAPI

Python is ubiquitous for those of us coming into programming from academic research backgrounds, both for generating scientific data and for analysing it. Of course, modern data-driven research only thrives when data is FAIR (findable, accessible, interoperable, and reusable) and traditional relational databases and APIs for interacting with them are a tried-and-true method of ensuring FAIRness. (And in the age of machine learning, the machines need things to be FAIR arguably even more than we do…)

“But I’m a data scientist/computational chemist/bioinformatician!” you protest. “I just (ab)use databases using Python, I don’t make them for someone else to use!” Understandable, but the barrier to entry may be much lower than you think, thanks to two related Python libraries: SQLModel and FastAPI. These powerful tools by developer Sebastián Ramírez (@tiangolo on GitHub) work in concert to make relational databases and APIs much easier to implement for Python natives. Let’s see how they can help…

Continue reading →

Nice TCR processing libraries

As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.

Continue reading →

Advanced PyMOL Visualization for Weighted Structural Ensembles (Part 1): Ensemble Comparison

When working with structural ensembles from molecular dynamics, AlphaFold2 subsampling, or ensemble reweighting against experimental data, you quickly run into visualization problems. Many of these problems standard PyMOL tutorials don’t address: what do you do when there’s no single reference structure?

In this two-part series, I’ll share the PyMOL techniques I’ve developed for visualizing weighted ensembles where multiple conformational states coexist. Part 1 covers reference state handling, RMSD-based coloring, and cluster visualization. Part 2 will tackle efficient SASA surface generation for large ensembles. To the best of my knowledge, this is the most advanced PyMOL guide EVER.

The code snippets here are extracted from full scripts attached at the end of this post. All examples use two systems: TeaA (a membrane transporter with distinct open/closed states) and MoPrP (mouse Prion Protein with partially unfolded forms).

Continue reading →

Design your very own drug: An introduction to structure-based small molecule drug design

Are you curious about how scientists design small molecules to treat disease using computational tools, but the words RDKit, docking, and QED mean nothing to you? Look no further than these tutorials for learning the fundamentals of computational small molecule drug design through interactive tutorials that introduce the key tools, concepts, and workflows. From generating compounds to evaluating their drug-likeness and binding potential, by the end you’ll be ready to explore how computational methods can result in the discovery of your very own (virtual) drug candidates to cure Zika!

Find the materials here: https://github.com/oxpig/dtc-struc-bio-smolecules/tree/main.

Continue reading →

Exploring the Protein Data Bank programmatically

The Worldwide Protein Data Bank (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data. Most researchers interact with the PDB either by downloading and parsing individual entries as mmCIF files (or as legacy PDB files), or by downloading aggregated data, such as the RCSB‘s collection in a single FASTA file of all polymer entity sequences. All too often, researchers end up laboriously writing their own file parsers to digest these files. In recent years though, more sophisticated tools have been made available that make it much easier to access only the data that you need.

Continue reading →

Controlling PyMol from afar

Do you keep downloading .pdb and .sdf files and loading them into PyMol repeatedly?

If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.

Continue reading →

Testing python (or any!) command line applications

Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!

Continue reading →

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.

A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:

Continue reading →

I really hope my compounds get the green light

As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.

Continue reading →

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.

Next time you need to create a pip installable package yourself, hopefully this can save you some time!

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Python

App-ready databases from Python with SQLModel and FastAPI

Nice TCR processing libraries

Advanced PyMOL Visualization for Weighted Structural Ensembles (Part 1): Ensemble Comparison

Design your very own drug: An introduction to structure-based small molecule drug design

Exploring the Protein Data Bank programmatically

Controlling PyMol from afar

Testing python (or any!) command line applications

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

I really hope my compounds get the green light

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py