The first/top section contains information the job scheduler needs in order to know what resources are needed to run the job. The first line in the job script file must be #!/bin/bash
.
After that are the job scheduler directives which all begin with #SBATCH
. These provide information to the scheduler like the name of the job and the computational resources needed. These directives are covered in more detail in the next section.
After these directives is the second section of the job script. This section is where you put commands to load and run your software and may also contain any other Linux commands you might need in order to perform calculations.
#!/bin/bash
#SBATCH --job-name=sample_chemistry-job
#SBATCH --output=chem.out
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4
#SBATCH --mem=12G
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=auser@fandm.edu
# Now that resources and such are specified actually setup and run software
module purge
module load mychemistrysoftware
computeChemistry inputFile.txt
The job script file should just contain text. Programs such as Microsoft Word add additional formatting characters that often can’t be seen when looking at the file, but cause issues with Slurm and therefore should not be used to write the job script files.
In the second section of the job script, where you run your software, any line starting with a #
is treated as a comment and will be ignored by the system. So in the above example, the line
# Now that resources and such are specified actually setup and run software
will be ignored by the system.
Once you have created your script, you can submit your job using the sbatch
command. For example if your job script file is called chem.job then you would submit your job as:
$ sbatch chem.job
Your job may or may not begin to run immediately. It will depend on whether or not the requested resources (e.g., CPUs, memory, nodes) are available and how many other jobs may be waiting in the job queue already.
These directives specify resource requirements and other job information (e.g., job name). In the job script, these directives must come after #!/bin/bash
and before any commands to run software. Each directive takes the general form:
#SBATCH --<flag>=<value>
For example, if you wanted to use only 2 nodes to run your job, you would include the following directive:
#SBATCH --nodes=2
Below is a table of some of the more commonly used directives. However, you don't need to specify all of these directives in your job script. A full list of directives and other options can be found in Slurm's documentation for sbatch.
Flag | Description | Example |
---|---|---|
--job-name | Name your job so you can identify it in the queue | --job-name=neuron-job |
--output | Specify a file that | --output=myjob.out |
--mail-type | Get email when job starts/completes | --mail-type=END,FAIL |
--mail-user | Email address to receive the email | --mail-user=auser@fandm.edu |
--nodes | Request a certain number of nodes to run the job | --nodes=4 |
--ntasks | The total number of CPUs needed to run the job | --ntasks=96 |
--ntasks-per-node | The number of processes you wish to assign to each node | --ntasks-per-node=24 |
--mem-per-cpu | Amount of memory needed on a per CPU basis | --mem-per-cpu=16G |
--mem | Amount of memory needed on a per node basis | --mem=16G |
--partition | Specify a partition. Currently only needed if using a GPU | --partition=gpus |
--gres | Specify a GRES. Currently only needed if using a GPU | --gres=gpu:1 |
We recommend always specifying the following directives
--job-name
(Be sure to make your job name unique to make it easier to troubleshoot any issues)--output
--mail-user
and -mail-type
--ntasks
and/or --nodes
--ntasks
> 16), then you request at least two nodes--mem
or --mem-per-cpu
(These two directives are mutaully exclusive) --mem
directive over the --mem-per-cpu
directive.For software that runs on a GPU you also must specify:
--partition=gpus
--gres
which will usually be in the form: --gres=gpu:1
The number after the colon is the number of GPUs needed. In almost all cases the value should be 1.Software packages sometimes output messages and errors directly to the terminal as they run. The --output
directive specifies a file name where such messages will be written to instead of being dispalyed to the terminal. It does not specify the name of output file(s) which your particular program may use to capture other output.
Those files still need to be specified as you normally would when running the software.
For the --output
directive we recommend using one of the following two formats depending on your specific circumstances:
--output=myfilename_%j.out
Here the %j
will be replaced by the job-id Slurm generates automatically. So if the job-id is 439 then the output would go to myfilename_439.job
.
This will make it easier to differentiate output files if you run the same job script over and over again.
--output=myfilename_%A_%a.out
Sometimes a job will have sub-jobs that get run (for example, running the same simulation multiples times from the same job script but each run uses different parameters).
In this case, %A
refers to the job-id and %a
refers to the sub-job-id. For example if you ran 3 simulations, their sub-job-ids might be 1, 2, and 3 producing output files, myfilename_439_1.out
, myfilename_439_2.out
, and myfilename_439_2.out
If your job will be requesting resources on more than one node then there are some things to consider. The --nodes
directive allows you to specify either an exact value (e.g., --nodes=4
or a range of values (e.g., --nodes=2-4
. In the first case, the scheduler will start your job when you are at the top of the job queue and that exact number of nodes is available (also taking into consideration CPU and memory requests). In the second case, the scheduler will start your job when you are at the top of the job queue and at least the number of nodes is available (again taking into consideration CPU and memory requests).
In addition, using one versus the other has implications when you also use the --ntasks-per-node
directive. Whether you use --nodes
with an exact value or a range, the total number of tasks (that is the number of CPUs) you get will be the number of nodes assigned to your job times the number of tasks per node. For example, if you specify --ntasks-per-node=8
and --nodes=4
you will get a total of 32 CPUs and your job will run when 4 nodes each with 8 free CPUS is available.
On the other hand, if you if you specify --ntasks-per-node=8
and --nodes=2-4
then your job might run with a total of 16, 24, or 32 CPUs depending on how many nodes with 8 available CPUs are free when your job reaches the top of the job queue. The tradeoff is that your job may start sooner but POSSIBLY with fewer CPUs/less memory which may also mean that the computation time may take longer. It is up to you to decide how you'd like to request nodes.
Requesting an appropriate amount of resources is extremely important because it has a direct impact on how many jobs can be running at a time. When a job starts running on the system, the system reserves the resources that are requested in the job script for as long as that job is running, whether the job actually uses the resources or not. For example, if your job requests #SBATCH --ntasks=24
(essentially 24 CPUs) but in the course of processing only uses 8 CPUs, then 16 CPUs did nothing and but could not be used for another job becuase they were reserved for your job. So, over-requesting resources can have a very negative impact on the ability of the scheduler to run jobs, affecting not only you but all other users on the cluster as well.
To be clear we aren't asking you to find the exact “right” values because there aren't necessarily exact right values. What we are asking is that you not grossly over-request resources for a job so the scheduler can run as many jobs as possible at one time.
The steps that follow for determining how many resources to request are a rough guidleline. There is no guarantee that any particular job won’t exceed resource limits because the amount of resources a particular job actually needs is dependent on many factors including (but not limited to)
From our perspective, it is better to have a job fail due to insufficient resources and re-run that job requesting more resources than it is to over-request (grossly over-request) resources which never get used AND probably prevent other jobs from running.
Before discussing the guidelines for determining resoucre amounts, you should first know what resources amounts are available on the cluster because these are physical upper limits. There are 3 key computational resources you will need to consider, CPUs, memory, and nodes. If your software uses a GPU, then you will additionally have to account for that. Requesting amounts in excess of these physical limits will cuase your job to fail immediately.
Currently the cluster has:
Again any single job should never use all of any of these resources. But just for clarity the computational upper limits specified as sbatch directives are as follows:
#SBATCH --nodes
cannot exceed 28 (effectively 28 since the GPU node should ony be used for GPU based software)#SBACTH --ntasks
#SBATCH --nodes
#SBATCH --nodes
then the number of tasks divided by the number of nodes cannot exceed 40#SBATCH --ntasks-per-node
cannot exceed 40#SBATCH --mem
cannot exceed 192G -OR-#SBATCH --mem-per-cpu
times ntasks cannot exceed 192GWe currently do not set any resource limits with the exception of using no more than two GPUs at a time. However we do recommend that you follow these suggestions:
#SBATCH --mem=6G
or the equivalent if specifying #SBATCH --mem-per-task
In addition, If you are going to request a certain number of nodes (using #SBATCH --nodes
, which we recommend doing) then we suggest requesting them in the general form #SBATCH --nodes=min-max
where min is the minimum number of nodes you'd like to use and max is the maximum number (as described above). Requesting nodes this way gives the scheduler more flexibility when it comes to actually starting your job.
By workflow what we mean, is the entire process you use to analyze data using a computer. This includes things like
Depending on the specifics of your workflow not all of these steps need or should be done on the cluster. We will assume that the bulk of computation, that which probably should be done on the cluster is the 2nd step.
The starting point for determing resource amounts should be determining if your workflow can use more than just one CPU to perform computations (i.e., run in parallel) which may will also impact how many nodes you might need. Below are some questions to help determine if your situation allows running parallel computations. If the answer to any of these questions is YES, then your job script should request more than just one CPU, either with #SBATCH --ntasks
or #SBATCH --ntasks-per-node
. Further guidance on the number of CPUs to request will be provided in the next few sections.
Just because you may not be able to run software in parallel, does not mean you should not use the cluster. If you have a computation or series of computations which may take a long time or require a lot of memory, running on the cluster may still make sense. It just means that you should specify #SBATCH --ntasks=1
and you do not need to use the #SBATCH --nodes
directive.
Below is a sample set of resource requests to use as a baseline for testing purposes. If you already know your software/workflow cannot be run in parallel, then remove the line #SBATCH --nodes=2-4
and change the first line to #SBATCH --ntasks=1
It is still worth doing some sample runs to get a sense of how much memory you may need.
This snippet only includes the reqeust for things like CPUs and memory, you should also include other directives for things like job name, output, etc., as well as all the commands you need to actually run the software.
#SBATCH --ntasks-per-node=6
#SBATCH --nodes=2-4
#SBATCH --mem=16G
#SBATCH --time=01:30:00
This script will request a minimum of 2 nodes and a maximum of 4 nodes, and anywhere from 12 to 24 CPUs depending on how many nodes are assigned to your job, and 16GB of memory per node. The final line sets a run time limit of 1.5 hours. This sample script is just to get a rough idea of your computational requirtements, so if your computation would normally take hours or more, you don't want to wait that long just for a test run.
We strongly suggest you do at least 3 test runs with different inputs in order to get a sense of the resource requirements.
Assuming the job runs without error, then when it does complete you should use the the seff
command to look at resource utilization. This command requires the job ID that Slurm assigns. For example, if your job id was 57957 you would run seff 57957
which will produce output similar to:
Job ID: 57957
Cluster: rcs-sc
User/Group: auser/auser
State: COMPLETED (exit code 0)
Nodes: 4
Cores per node: 6
CPU Utilized: 2-13:11:03
CPU Efficiency: 99.24% of 2-13:39:12 core-walltime
Job Wall-clock time: 00:38:32
Memory Utilized: .57 GB
Memory Efficiency: 0.91% of 64.00 GB
There are three lines you should focus on:
#SBATCH --mem=2GB
which is still probably more than you need, but allows for some wiggle room in terms of memory usage while still leaving a good amount of memory available for other jobs.In terms of using multiple CPUs in parallel, more is not always better. There is eventually a point where having more CPUs does not produce a computational benefit especially in relation to how long it takes software to run.
--ntasks-per-node
seff
after the jobs complete to assess your resource usage and adjust your jobs scripts as necessary.