Category Archives: Hints and Tips

On the Joys of vim-like Browsing

Reflections on Pointlessness

One of the great delights in this life is pointless optimisation. Point-ful optimisation has its place of course; it is right and proper and sensible, and, well, useful, and it also does, when first achieved, yield considerable satisfaction. But I have found I soon adjust to the newly more efficient (and equally drab) normality, and so the spell fades quickly.

Not so with pointless optimisation. Pointless optimisation, once attained, is a preternaturally persistent source of joy that keeps on giving indefinitely. Particularly if it involves acquiring a skill of some description; if the task optimised is frequent; and if the time so saved could not possibly compensate for the time and effort sunk into the optimisation process. Words cannot convey the triumph of completing a common task with hard-earned skill and effortless efficiency, knowing full-well it makes no difference whatsoever in the grand scheme of things.

Continue reading

Memory Efficient Clustering of Large Protein Trajectory Ensembles

Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?

Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.

The solution lies in recognizing that we don’t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let’s explore how three key strategies—incremental PCA, mini-batch clustering, and intelligent memory management—can transform your approach to analyzing large protein ensembles.

Continue reading

Debugging code for science: Fantastic Bugs and Where to Find Them.

The simulation results make no sense … My proteins are moving through walls and this dihedral angle is negative; my neural network won’t learn anything, I’ve tried for days to install this software and I still get an error.

Feel familiar? Welcome to scientific programming. Bugs aren’t just annoying roadblocks – they’re mysterious phenomena that make you question your understanding of reality itself. If you’ve ever found yourself debugging scientific code, you know it’s a different beast compared to traditional software engineering. In the commercial software world, a bug might mean a button doesn’t work or data isn’t saved correctly. In scientific computing, a bug might mean your climate model predicts an ice age next Tuesday, or your protein folding algorithm creates molecular structures that couldn’t possibly exist in our universe (cough).

Continue reading

GUI Science

There comes a point in every software-inclined lab based grad student’s life, where they think: now is the time to write a GUI for my software, to make it fast, easy to use, generalised so that others will use it too, the new paradigm for how to do research, etc. etc.

Of course such delusions of grandeur are rarely indulged, but when executed they certainly can produce useful outputs, as a well designed (or even, designed) GUI can improve an experimentalist’s life profoundly by simplifying, automating and standardising data acquisition, and by reducing the time to see results, allowing for shorter iteration cycles (this is known in engineering as “Design, Build, Test, Learn” cycle, in software it’s called “Coding”).

Having written a few GUIs in my time, I thought it might be helpful to share some experience I have, though it is by no means broad.

Continue reading

AI in Academic Writing: Ethical Considerations and Best Practices

I don’t need to tell you how popular AI, in particular LLMs, have become in recent years. Alongside this rapid growth comes uncharted territory, especially with respect to plagiarism and integrity. As we adapt to a rapidly changing technological climate, we become increasingly reliant on AI. Need some help phrasing an email? Ask ChatGPT. Need to create a packing list for an upcoming camping trip? Get an AI-based task manager. So naturally when we’re faced with the daunting, and admittedly tedious task of writing papers, we question whether we can offload some of that work to a machine. As with many things, the question is not simply whether or not you should use AI in your writing process, it’s how you choose to do so.

When To Use AI

  1. Grammar and readability
    Particularly useful for those who are writing in a second language, or for those who struggle to write in their first language (my high school English teachers would place me firmly in this category), LLMs can be used beneficially to identify awkward phrasing, flag excessively complex sentences, and identify and fix grammatical errors.
  2. Formatting and structure
    LLMs can take care of some of the tedious, repetitive work with respect to formatting and structure. They are particularly useful for generating LaTeX templates for figures, tables, equations, and general structure. You can also use them to check that you’re matching a specific journal’s standards. For example, you can give an LLM a sample of articles from a target publication, and ask it to note the structure of these papers. Then, give it your work and ask it to make general, larger-scale suggestions to ensure that your work aligns structurally with articles typical of that journal.
  3. Reference management
    Although the references should be read and cited by an author, various management tasks like creating properly formatted BibTeX entries can be handled by LLMs. Additionally, you can use LLMs to do a sanity check to ensure that your references are an accurate reflection of the source material they are referring to. However, they should not to be used to summarise the source and create references on their own. 
  4. Summarising large volumes of literature
    If you’re reviewing large volumes of literature, LLMs can help summarise papers efficiently and point you in the right direction. Although you should always cite and refer back to the original source, LLMs can distill key points from long, dense papers, organise notes, and extract important takeaways from datasets and figures.

Regardless of how you use AI, it is importance to keep a record of all instances of such AI use throughout your research, including use during coding, Some journals will make you explicitly declare the use of AI tools, but even if it is not required this kind of record-keeping is considered good practice. 

When Not to Use AI

  1. Big-picture thinking and narrative development
    Academic papers are not solely about presenting information, they are about constructing an argument, building a narrative flow, and presenting a compelling case. LLMs are not particularly good at replicating human creativity, that work is best left to the authors. Additionally, it is dishonest to claim these important aspects of writing as your own if they are not written directly by you.
  2. Direct copy-paste
    Although AI tools may suggest minor edits, you should never directly copy-and-paste larger selections of AI-generated text. If the ethical concerns described in (1) do not persuade you as they should, there are now plenty of tools being used to detect AI-generated text by various academic institutions and journals. Although some scholars do tend to lean on AI as a more collaborative tool, transparency is key. 
  3. Source of knowledge
    LLMs don’t actually “know” anything; they generate responses based on probability. As a result, they have a tendency to “hallucinate,” or present false information as fact, misrepresent or oversimplify complex concepts, and do not have precise technical accuracy. They may also be biased based on the sources they were trained on. Peer-reviewed sources should be the go-to for actual information. If you use LLMs to summarise something, always refer back to the original text when using that information in your work.
  4. Full citation generation
    As discussed above, although AI can be used to summarise sources, it is not a reliable source of direct citations. All references should be created by hand and verified manually.
  5. General over-reliance
    From the research design process to the final stages of writing and editing, you should generally refrain from an over-reliance on AI. Although LLMs can be powerful tools that can be used to automate various lower-level tasks, they are not a substitute for critical thinking, originality, or domain expertise, and they are not a full-fledged co-author of your work. The intellectual contribution and ownership of ideas remains in the hands of the human authors. 

For further and more official guidance, check out the ethical framework for the use of AI in academic research published in Nature Machine Intelligence. This framework outlines three criteria for the responsible use of LLMs in scholarship, summarised as follows:

  1. Human vetting and guaranteeing the accuracy and integrity 
  2. Substantial human contribution across all areas of the work
  3. Acknowledgement and transparency of AI use

The Good (and limitations) of using a Local CoPilot with Ollama

Interactive code editors have been around for a while now, and tools like GitHub Copilot have woven their way into most development pipelines, and for good reason. They’re easy to use, exceptionally helpful (at certain tasks), and have undeniably made life as a developer smoother. Recently, I decided to switch away from relying on GitHub Copilot in favour of a local model for a few key reasons. While I don’t use it all the time, it has proven to be a useful option in many situations. In this blog post, I’ll go over why I made the switch, how I set it up, and share a bit about my experience so far.

Continue reading

Baby’s First NeurIPS: A Survival Guide for Conference Newbies

There’s something very surreal about stepping into your first major machine learning conference: suddenly, all those GitHub usernames, paper authors, and protagonists of heated twitter spats become real people, the hallways are buzzing with discussions of papers you’ve been meaning to read, and somehow there are 17,000 other people trying to navigate it all alongside you. That was my experience at NeurIPS this year, and despite feeling like a microplankton in an ocean of ML research, I had a grand time. While some of this success was pure luck, much of it came down to excellent advice from the group’s ML conference veterans and lessons learned through trial and error. So, before the details fade into a blur of posters and coffee breaks, here’s my guide to making the most of your first major ML conference.

Continue reading

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

Making Pretty Pictures in PyMOL v2

Throughout my PhD I’ve needed nice PyMOL visualizations, but struggled to quickly and easily make the pictures I wanted. I’ve used Claire Marks‘ blopig post, Making Pretty Pictures in PyMOL, many times and wanted to expand it with what I’ve learned to make satisfying visualizations quickly!

Continue reading

Controlling PyMol from afar

Do you keep downloading .pdb and .sdf files and loading them into PyMol repeatedly?

If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.

Continue reading