Additional Slurm commands

Vieiwng the job queue

The squeue command is a command that can be used to display information about the jobs in queue. By default, it will print out (left to right)

job ID
partition
username
job status - Most common states are R for running, and *PD** for pending
number of nodes
name of nodes - the actual nodes the job is running on For example:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   nodes nil-neur nsengupt  R 1-19:38:33     15 n[07-21]
     osg bl_3jsGg    osg01 PD       0:00      1 (Priority)
     osg bl_lUjLu    osg01 PD       0:00      1 (Priority)
     osg bl_rqGUg    osg01 PD       0:00      1 (Priority)
     osg bl_Sixws    osg01 PD       0:00      1 (Priority)
     osg bl_8Afn2    osg01 PD       0:00      1 (Priority)
     osg bl_WWNqJ    osg01 PD       0:00      1 (Priority)
     osg bl_DqhRR    osg01 PD       0:00      1 (Priority)
     osg bl_5dXSj    osg01 PD       0:00      1 (Resources)
     osg bl_fthxT    osg01  R    6:23:21      1 n36
     osg bl_0OkIh    osg01  R    6:25:19      1 n32
     osg bl_9pmce    osg01  R    8:31:26      1 n35
     osg bl_effaH    osg01  R   12:24:54      1 n29
     osg bl_cdVsx    osg01  R   12:25:00      1 n30
     osg bl_87MDq    osg01  R   13:10:32      1 n33
     osg bl_4DTvm    osg01  R   13:32:17      1 n34
     osg bl_Lvpox    osg01  R   13:49:47      1 n31
  

In this example there are 17 jobs in the queue, 9 of them are running and 8 are waiting to run. There are many different job states including ST, stopped, OOM, out of memory, and F, failed, but most of the time you see running and pending.

If you wish to see the state of only your jobs, you can use the --user flag as:

$ squeue --user=osg01
  

You can also output additional information with the --long flag. This flag will print out the non-abbreviated default information with the addition of a timelimit field:

$ squeue --user=username --long
  

Finally, for pending jobs, you can use the squeue command to display your the estimated start time for jobs by adding the --start flag.

$ squeue --user=username --start
  

Using --start only provides an estimate and that time could be greatly impacted by other
fatcors such as jobs with a higher priority or a lack of computational resources when your job reaches the top of the job queue.

For more information on squeue, visit the Slurm page for squeue

Managing your jobs

Sometimes you may need to stop a job entirely either while it's running or before it starts. This can be done with the scancel command. The general form of the command is:

$ scancel job-id

To cancel multiple jobs, you can use a comma-separated list of job IDs:

$ scancel job-id1, job-id2, jobid3

For more information, visit the Slurm page for scancel

Reporting on running jobs

The sstat command allows users to see detailed information for currently running jobs. This includes information such as CPU usage, task information, node information, resident set size(RSS) (i.e. memory usage), and and much more. The basic form for the sstat command is:

$ sstat --jobs=job-id
  

By default, it will display a large amount of information, more information than you likely want/need to see. You can limit the information you see by using the --format option. This option takes a comma-separated list of stats to report:

$ sstat --jobs=job-id --format=stat1,stat2,stat3,...
  

For an example, to print out a job's id, average cpu time, and number of tasks the command would be:

sstat --jobs=job-id --format=jobid,avecpu,ntasks
  

A full list of statistics that can be used can be printed to the screen with the command sstat --helpformat or you can find it (and more) by visiting the slurm page for sstat.

Reporting on completed jobs

seff

The seff command provides information on how effeciently a job used CPU and memory resources. This command can be especially helpful in deciding how many resources you need to request for a job. Below is an example command and associated output:

[auser@rcs-scsn neuron_mpi]$ seff 57957
Job ID: 57957
Cluster: rcs-sc
User/Group: auser/auser
State: COMPLETED (exit code 0)
Nodes: 4
Cores per node: 24
CPU Utilized: 2-13:11:03
CPU Efficiency: 95.24% of 2-13:39:12 core-walltime
Job Wall-clock time: 00:38:32
Memory Utilized: 2.32 GB
Memory Efficiency: 0.91% of 256.00 GB
  

Based on this output, it used 96 CPUs (Nodes * Cores per node) and that the CPU utilization was very good, 95.24%
This basically means that pretty much all the CPUs were in use for the entire time the job ran. Memory utilization was not so good though. The job requested 256GB of memory but only used 2.32GB (.91% utilization).

sacct

The sacct command providers a much greater amount of information for completed jobs. It is very similar to the sstat command in terms of what it shows and the options you can use with the command. Below are two very basic examples. The first will print information for one specific job, where the job-id is 12345. The second prints job information for all jobs run by auser.

$ sacct --jobs=12345
$ sacct --user=auser
  

As with sstat, the information that is displayed can be adjusted using the --format option.

By default, sacct will retrieve jobs that were run during the current day. You can use the --starttime and --endtime options to adjust time frame for searching. For example, this will print information on all of auser’s jobs that started on or after March 15th, 2019, up to the current time.

$ sacct --user=auser --starttime=2019-04-15

As another example of using sacct, suppose you want to see information about your jobs that were run on March 12, 2018. For each job you want to see the job name, the number of nodes used in the job, the number of cpus, and the elapsed time. The command would look like this:

$ sacct --user=username --starttime=2018-03-12 --format=jobname,nnodes,ncpus,elapsed
  

More details regarding sacct by visiting the slurm page for sacct.