AlphaGeometry: are computers taking over math?

Last week, Google DeepMind announced AlphaGeometry, a novel deep learning system that is able to solve geometry problems of the kind presented at the International Mathematics Olympiad (IMO). The work is described in a recent Nature paper, and is accompanied by a GitHub repo including full code and weights.

This paper has caused quite a stir in some circles. Well, at least the kind of circles that you tend to get in close contact with when you work at a Department of Statistics. Like folks in structural in biology wondered three years ago, those who earn a living by veering into the mathematical void and crafting proofs, were wondering if their jobs may also have a close-by expiration date. I found this quite interesting, so I decided to read the paper and try to understand it — and, to motivate myself, I set to present this paper at an upcoming journal club, and also write this blog post.

So, let’s ask, what has actually been achieved and how powerful is this model?

What has been achieved

The image that has been making the rounds this time is the following benchmark:

Continue reading →

The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.

In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.

In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.

Violin plots of conductance distributions from 64 molecular dynamic trajectories with 1000 frames each.

Continue reading →

Taking Equivariance in deep learning for a spin?

I recently went to Sheh Zaidi‘s brilliant introduction to Equivariance and Spherical Harmonics and I thought it would be useful to cement my understanding of it with a practical example. In this blog post I’m going to start with serotonin in two coordinate frames, and build a small equivariant neural network that featurises it.

Continue reading →

PhD as a mother

As a mother currently pursuing my doctorate, I often encounter the belief that higher education is not the ideal time for parenthood. In this post, I want to share my personal experience, offering a different perspective.

A year ago, I began my doctorate with a two-and-a-half-month-old baby. When I received the acceptance email from Oxford, I was thrilled – a dream come true. However, this raised a question: could I pursue this dream while pregnant? I believed in balancing motherhood and academic aspirations, and my advisor’s encouragement reinforced this belief. We, as a family, moved from Israel to England, adjusting to this new chapter.

It hasn’t been easy. Physically, post-pregnancy recovery and sleepless nights were tough. Emotionally, I constantly struggle with guilt over balancing academic and maternal responsibilities. If I focus on my daughter, I worry about neglecting my research; if I concentrate on my studies, I feel like a bad mother. The logistics of managing a household, especially when being the primary caregiver, added another layer of complexity. Motherhood often feels isolating, as not everyone around me can relate to my situation.

Yet, doctoral studies offered unexpected advantages. The flexibility allows me to align my work with my daughter’s schedule, often during nights or weekends. This means I can compensate for lost time without impacting others, unlike in a regular job. Interestingly, this flexibility leads to more time spent with my daughter than if I had a typical job. Moreover, the challenges of motherhood put academic obstacles into perspective. The best part of my day is always the hug from my daughter after a day of work.

As I keep moving forward with my PhD, here are some key tips that have helped me so far:

Flexible Scheduling: Organize daily tasks, including household chores, within specific hours to enhance efficiency.
Creating a Supportive Environment: Having a support system, be it your partner or friends, is crucial. Address practical issues early on, like daycare and babysitters, and don’t be shy to ask for help.
Aligning Expectations with Your Supervisor: Communicate your limitations early to avoid misunderstandings.
Practice Compassion: Acknowledge that you can’t do everything and be kind to yourself.

In the race of life, there never seems to be a “right” time for children. Whether it’s career progression or personal aspirations, the timing is always challenging. However, if you feel ready, that is the right time for you.

On National AI strategies

Recently, I have become quite interested in how countries have been shaping their national AI strategies or frameworks. Since the launch of ChatGPT, several concerns have been raised about AI safety and how such groundbreaking AI technologies could augment or adversely affect our daily lives. To address the public’s concerns and set standards and practices for AI development, some countries have recently released their national AI frameworks. As a budding academic researcher in this space who is keen to make AI more useful for medicine and healthcare, there are two key aspects from the few frameworks I have looked at (specifically the US, UK and Singapore) that are of interest to me, namely, the multi-stakeholder approach and focus on AI education which I will delve further into in this post.

Continue reading →

Antibody Engineering & Therapeutics 2023

San Diego

And I wish I been out in California
When the lights on all the Christmas trees went out
Jagger & Richards

Festive conference blog here, dreaming of my December days in sunny California where I attended Antibody Engineering and Therapeutics 2023. Replete with Stones lyrics, Science and Sunset imagery. Hope y’all enjoy!

Continue reading →

How to get more information from slurm?

So the servers you use have Slurm as their job scheduler? Blopig has very good resources to know how to navigate a Slurm environment.

If you are new to SLURMing, I highly recommend Alissa Hummer’s post . There, she explains in detail what you will need to submit, check or cancel a job, even how to run a job with more than one script in parallel by dividing it into tasks. She is so good that by reading her post you will learn how to move files across the servers, create and manage SSH keys as well as setting up Miniconda and github in a Slurm server.

And Blopig has even more to offer with Maranga Mokaya’s and Oliver Turnbull’s posts as nice complements to have a more advanced use of Slurm. They help with the use of array jobs, more efficient file copying and creating aliases (shortcuts) for frequently used commands.

So… What could I possibly have to add to that?

Well, suppose you are concerned that you or one of your mates might flood the server (not that it has ever happened to me, but just in case).

Helga G. Patak — From heyarnold official https://giphy.com/ page

How would you go by figuring out how many cores are active? How much memory is left? Which GPU does that server use? Fear not, as I have some basic tricks that might help you.

Get information about the servers and nodes:

A pretty straight forward way of getting to know some information on slurm servers is the use of the command:

sinfo -M ALL

Which will give you information on partition names, if that partition is available or not, how many nodes it has, its usage state and a list with those nodes.

CLUSTER: name_of_cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
low         up  7-00:00.0    1  idle  node_name.server.address

The -M ALL argument is used to show every cluster. If you know the name of the cluster you can use:

sinfo -M name_of_cluster

But what if you want to know not only if it is up and being used, but how much of its resource is free? Fear not, much is there to learn.

You can use the same sinfo command followed by some arguments, that will give you what you want. And the magic command is:

sinfo -o "%all" -M all

This will show you a lot of information abou every partition of every cluster

CLUSTER: name_of_cluster
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS

A light pink pudgy penguin angrily saying: That's too much information. — From heyarnold official https://giphy.com/ page

Which is a lot.

So, how can you make it more digestible and filter only the info that you want?

Always start with:

sinfo -M ALL -o "%n"

And inside the quotations you should add the info you would like to know. The %n arguments serves to show every node, the hostname, in each cluster. If you want to know how much free memory there is in each node you can use:

sinfo -M ALL -o "%n %e"

In case you would like to know how the CPUs are being used (how many are allocated, idle, other and total) you should use

sinfo -M ALL -o "%n %e %C"

Well, I could give more and more examples, but it is more efficient to just leave the table of possible arguments here. They come from slurm documentation.

Argument	What does it do?
%all	Print all fields available for this data type with a vertical bar separating each field.
%a	State/availability of a partition.
%A	Number of nodes by state in the format “allocated/idle”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%b	Features currently active on the nodes, also see %f.
%B	The max number of CPUs per node available to jobs in the partition.
%c	Number of CPUs per node.
%C	Number of CPUs by state in the format “allocated/idle/other/total”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%d	Size of temporary disk space per node in megabytes.
%D	Number of nodes.
%e	The total memory, in MB, currently free on the node as reported by the OS. This value is for informational use only and is not used for scheduling.
%E	The reason a node is unavailable (down, drained, or draining states).
%f	Features available the nodes, also see %b.
%F	Number of nodes by state in the format “allocated/idle/other/total”. Note the use of this format option with a node state format option (“%t” or “%T”) will result in the different node states being be reported on separate lines.
%g	Groups which may use the nodes.
%G	Generic resources (gres) associated with the nodes. (“Graph Card” that the node uses)
%h	Print the OverSubscribe setting for the partition.
%H	Print the timestamp of the reason a node is unavailable.
%i	If a node is in an advanced reservation print the name of that reservation.
%I	Partition job priority weighting factor.
%l	Maximum time for any job in the format “days-hours:minutes:seconds”
%L	Default time for any job in the format “days-hours:minutes:seconds”
%m	Size of memory per node in megabytes.
%M	PreemptionMode.
%n	List of node hostnames.
%N	List of node names.
%o	List of node communication addresses.
%O	CPU load of a node as reported by the OS.
%p	Partition scheduling tier priority.
%P	Partition name followed by “*” for the default partition, also see %R.
%r	Only user root may initiate jobs, “yes” or “no”.
%R	Partition name, also see %P.
%s	Maximum job size in nodes.
%S	Allowed allocating nodes.
%t	State of nodes, compact form.
%T	State of nodes, extended form.
%u	Print the user name of who set the reason a node is unavailable.
%U	Print the user name and uid of who set the reason a node is unavailable.
%v	Print the version of the running slurmd daemon.
%V	Print the cluster name if running in a federation.
%w	Scheduling weight of the nodes.
%X	Number of sockets per node.
%Y	Number of cores per socket.
%z	Extended processor information: number of sockets, cores, threads (S:C:T) per node.
%Z	Number of threads per core.

And there you have it! Now you can know what is going on your slurm clusters and avoid job-blocking your peers.

If you want to know more about slurm, keep an eye on Blopig!

Finding and testing a reaction SMARTS pattern for any reaction

Have you ever needed to find a reaction SMARTS pattern for a certain reaction but don’t have it already written out? Do you have a reaction SMARTS pattern but need to test it on a set of reactants and products to make sure it transforms them correctly and doesn’t allow for odd reactants to work? I recently did and I spent some time developing functions that can:

Generate a reaction SMARTS for a reaction given two reactants, a product, and a reaction name.
Check the reaction SMARTS on a list of reactants and products that have the same reaction name.

Continue reading →

Online tools for drawing and visualizing molecules

I recently came across a nice tool for depicting multiple molecules called CDK Depict (thanks to Ruben for sending it to me), so I decided to explore what other web-based molecule visualization and drawing tools are available.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

AlphaGeometry: are computers taking over math?

What has been achieved

The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Taking Equivariance in deep learning for a spin?

PhD as a mother

On National AI strategies

Antibody Engineering & Therapeutics 2023

San Diego

How to get more information from slurm?

Get information about the servers and nodes:

Finding and testing a reaction SMARTS pattern for any reaction

Online tools for drawing and visualizing molecules