How to get more information from slurm?

So the servers you use have Slurm as their job scheduler? Blopig has very good resources to know how to navigate a Slurm environment. 

If you are new to SLURMing, I highly recommend Alissa Hummer’s post . There, she explains in detail what you will need to submit, check or cancel a job, even how to run a job with more than one script in parallel by dividing it into tasks. She is so good that by reading her post you will learn how to move files across the servers, create and manage SSH keys as well as setting up Miniconda and github in a Slurm server.

And Blopig has even more to offer with Maranga Mokaya’s and Oliver Turnbull’s posts as nice complements to have a more advanced use of Slurm. They help with the use of array jobs, more efficient file copying and creating aliases (shortcuts) for frequently used commands.

So… What could I possibly have to add to that?

Well, suppose you are concerned that you or one of your mates might flood the server (not that it has ever happened to me, but just in case).

Helga G. Patak
From heyarnold official https://giphy.com/ page

How would you go by figuring out how many cores are active? How much memory is left? Which GPU does that server use? Fear not, as I have some basic tricks that might help you.

Get information about the servers and nodes:

A pretty straight forward way of getting to know some information on slurm servers is the use of the command:

sinfo -M ALL

Which will give you information on partition names, if that partition is available or not, how many nodes it has, its usage state and a list with those nodes.

CLUSTER: name_of_cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
low         up  7-00:00.0    1  idle  node_name.server.address 

The -M ALL argument is used to show every cluster. If you know the name of the cluster you can use:

sinfo -M name_of_cluster

But what if you want to know not only if it is up and being used, but how much of its resource is free? Fear not, much is there to learn.

You can use the same sinfo command followed by some arguments, that will give you what you want. And the magic command is:

sinfo -o "%all" -M all

This will show you a lot of information abou every partition of every cluster

CLUSTER: name_of_cluster
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS 
A light pink pudgy penguin  angrily saying: That's too much information.
From heyarnold official https://giphy.com/ page

Which is a lot.

So, how can you make it more digestible and filter only the info that you want?

Always start with:

sinfo -M ALL -o "%n" 

And inside the quotations you should add the info you would like to know. The %n arguments serves to show every node, the hostname, in each cluster. If you want to know how much free memory there is in each node you can use:

sinfo -M ALL -o "%n %e"

In case you would like to know how the CPUs are being used (how many are allocated, idle, other and total) you should use

sinfo -M ALL -o "%n %e %C"

Well, I could give more and more examples, but it is more efficient to just leave the table of possible arguments here. They come from slurm documentation.

ArgumentWhat does it do?
%allPrint all fields available for this data type with a vertical bar separating each field.
%aState/availability of a partition.
%ANumber of nodes by state in the format “allocated/idle”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%bFeatures currently active on the nodes, also see %f.
%BThe max number of CPUs per node available to jobs in the partition.
%cNumber of CPUs per node.
%CNumber of CPUs by state in the format “allocated/idle/other/total”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%dSize of temporary disk space per node in megabytes.
%DNumber of nodes.
%eThe total memory, in MB, currently free on the node as reported by the OS. This value is for informational use only and is not used for scheduling.
%EThe reason a node is unavailable (down, drained, or draining states).
%fFeatures available the nodes, also see %b.
%FNumber of nodes by state in the format “allocated/idle/other/total”. Note the use of this format option with a node state format option (“%t” or “%T”) will result in the different node states being be reported on separate lines.
%gGroups which may use the nodes.
%GGeneric resources (gres) associated with the nodes. (“Graph Card” that the node uses)
%hPrint the OverSubscribe setting for the partition.
%HPrint the timestamp of the reason a node is unavailable.
%iIf a node is in an advanced reservation print the name of that reservation.
%IPartition job priority weighting factor.
%lMaximum time for any job in the format “days-hours:minutes:seconds”
%LDefault time for any job in the format “days-hours:minutes:seconds”
%mSize of memory per node in megabytes.
%MPreemptionMode.
%nList of node hostnames.
%NList of node names.
%oList of node communication addresses.
%OCPU load of a node as reported by the OS.
%pPartition scheduling tier priority.
%PPartition name followed by “*” for the default partition, also see %R.
%rOnly user root may initiate jobs, “yes” or “no”.
%RPartition name, also see %P.
%sMaximum job size in nodes.
%SAllowed allocating nodes.
%tState of nodes, compact form.
%TState of nodes, extended form.
%uPrint the user name of who set the reason a node is unavailable.
%UPrint the user name and uid of who set the reason a node is unavailable.
%vPrint the version of the running slurmd daemon.
%VPrint the cluster name if running in a federation.
%wScheduling weight of the nodes.
%XNumber of sockets per node.
%YNumber of cores per socket.
%zExtended processor information: number of sockets, cores, threads (S:C:T) per node.
%ZNumber of threads per core.

And there you have it! Now you can know what is going on your slurm clusters and avoid job-blocking your peers.

If you want to know more about slurm, keep an eye on Blopig!

Author