All About SLURM Jobs¶
Well, maybe not ALL.
Here we try to describe some basic facts about jobs. We will point to this document in the other places in the web site where we mention some of these core facts. There are links to some of these other web pages in this document, but see the navigation bar for a complete list of SLURM topics/pages.
SLURM is popular because it is flexible and supports many different computational needs. Users do complex things on various clusters, and the way you normally use SLURM may not at all match the way someone else does.
WHAT IS A SLURM JOB?¶
A job can span multiple compute nodes and is the sum of all task resources.
A job consists of
- an optional set of directives
- a command to run, either bash (for an interactive job) or a shell script1 (for a batch job)
- one or more steps, numbered starting from 0.2
Interactive jobs provide a shell interface on a single compute node. There you can run programs interactively, including ones with GUI interfaces that display via the X11 protocal. For more information, see our page here.
Batch jobs are handled by a job scheduler within SLURM, and start when they can get access to one or more compute nodes providing sufficient resources. These jobs do their work unattended. For more information, see our page here.
Array jobs are one type of batch job. They consist of 2 or more jobs spawned by a single batch script. For more information, see this section of our batch job page here.
DIRECTIVES¶
Directives can come from a variety of sources, including, in order of preference: on the command line, from environment variables, embedded in a batch file script on #SBATCH
lines, or from a personal ~/.slurm/defaults
file. If you don't specify a directive, them some default value will be used. Often that default is one, as in one task, one CPU core, one compute node, ... The default partition is shared
. 5 gigabytes is the default amount of RAM per job.
Directives consist of:
- requests for resources (time, partition, memory, CPU)
- instructions to SLURM
You will find directives described in the sbatch
manual page.
TASKS¶
A job aims to complete one or more tasks.
Tasks are requested at the job level with --ntasks
or --ntasks-per-node
, or at the step level with --ntasks
. CPUs are requested per task with --cpus-per-task
.
A task is executed on a single compute node in a job step, using one or more CPU cores and a non-zero amount of memory (5 gigabytes if you don't specify -- you can request less than 5G!! -- the less memory your job needs, the more likely it will be able to run).
If there are multiple tasks, then each uses a subset of the resources of the overall job. Multiple tasks can run on a single compute node or on multiple nodes. The resources used by any one task cannot exceed what is found on a single node.
NODES and PARTITONS¶
Jobs execute on a set of one or more compute nodes called a nodelist. The first node in that set is called the 0th node.3
Compute nodes are grouped together into partitions. For more information about partitions, see our page here.
Warning
If a job exceeds its time limit or memory, it will be killed before it finishes doing its work.
JOB NOTATION¶
Once jobs are submitted and accepted, they are given a job id number.
When specifying a job to a SLURM program like scontrol
or reading output from a SLURM program like sacct
, you will see a variety of forms of identification numbers.
In our documentation and email, we often say simply "job" or "jobid" with the expectation that you will determine the right value to use in the situation.
Underscores _ divide job arrays from job ids
Periods . divide job ids from step ids.
Sometimes you can use a job array number where the documentation says job id, as when cancelling the whole job array with scancel
as opposed to cancelling just one of its job elements.
<job id>
<job id>.<step id>
<job array>_<job id>
<job array>_<job id>.<step id>
JOB HAVE MULTIPLE STEPS¶
The average job will have two or more steps of these types:
- external step - the connection from the node on which you submitted the job and the leading node of your nodelist. This step normally succeeds whether or not your overall job does.
- batch step - created for jobs submitted with
sbatch
The exit code of this script impacts the final STATE of this step. - interactive step - created for jobs submitted with
srun
(outside of a batch job) - normal step - a batch job can have multiple normal steps, which will appear in accounting output as
<job_id>.<step_id>
Each step will be created by ansrun
command. Step numbers start at 0. Interactive jobs do not have any normal steps.
You will notice in the output of sacct
commands that each job has multiple entries, including one for the overall job. Each entry has a STATE code (e.g. COMPLETED, FAILED, CANCELLED) and an EXITCODE (e.g. 0:0). Keep in mind when viewing state information whether you are looking at the overall job state or that of a component step.
JOB NAMES¶
You can give a job a job name. This explicit name can be used with some commands instead of jobids.
- The name of an external step will always be "extern"
- The default overall name of a batch job will be "bash"
- The name of a batch job's normal step(s) will always be "bash"
- The default overall name of an interactive job will be "bash"
The sacct
command is an important one to know how to use. We have a document explaining how to use it here. Please read it, because you need to know how to look up information about your jobs. By default it prints fields only 20 characters wide. It will display a + when there is more information beyond 20 char, as seen below for the external steps. (See our document for instructions on changing the output format.)
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1922948 bash cee jhpce 24 RUNNING 0:0
1922948.ext+ extern jhpce 24 RUNNING 0:0
1922948.0 bash jhpce 24 RUNNING 0:0
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1542874 bash shared jhpce 2 RUNNING 0:0
1542874.ext+ extern jhpce 2 RUNNING 0:0
JOB STATES¶
The overall job and each step has its own job state code. They often differ!!
Job states have short names consisting of one or two letters, and a full name. You can use either form when working with SLURM commands. They are shown here capitalized for emphasis but can be specified as lower-case.
The main job states you will see:
Short Name | Long Name | Explanation |
---|---|---|
PD | PENDING | Job is waiting to start |
R | RUNNING | Job is currently running |
CG | COMPLETING | Job has ended, clean-up has begun |
CD | COMPLETED | Job finished normally, with exit code 0 |
F | FAILED | Job finished abnormally, with a non-zero exit code |
CA | CANCELLED | Job was cancelled by the user or a sysadmin |
OOM | OUT_OF_MEMORY | Job was killed for exceeding its memory allocation |
TO | TIMEOUT | Job was killed for exceeding its time limit |
Click for a complete list of job states
FROM https://slurm.schedmd.com/squeue.html#lbAG
- BF BOOT_FAIL
- Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
- CA CANCELLED
- Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
- CD COMPLETED
- Job has terminated all processes on all nodes with an exit code of zero.
- CF CONFIGURING
- Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
- CG COMPLETING
- Job is in the process of completing. Some processes on some nodes may still be active.
- DL DEADLINE
- Job terminated on deadline.
- F FAILED
- Job terminated with non-zero exit code or other failure condition.
- NF NODE_FAIL
- Job terminated due to failure of one or more allocated nodes.
- OOM OUT_OF_MEMORY
- Job experienced out of memory error.
- PD PENDING
- Job is awaiting resource allocation.
- PR PREEMPTED
- Job terminated due to preemption.
- R RUNNING
- Job currently has an allocation.
- RD RESV_DEL_HOLD
- Job is being held after requested reservation was deleted.
- RF REQUEUE_FED
- Job is being requeued by a federation.
- RH REQUEUE_HOLD
- Held job is being requeued.
- RQ REQUEUED
- Completing job is being requeued.
- RS RESIZING
- Job is about to change size.
- RV REVOKED
- Sibling was removed from cluster due to other cluster starting the job.
- SI SIGNALING
- Job is being signaled.
- SE SPECIAL_EXIT
- The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.
- SO STAGE_OUT
- Job is staging out files.
- ST STOPPED
- Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
- S SUSPENDED
- Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
- TO TIMEOUT
- Job terminated upon reaching its time limit.
That complete list comes from this section of the sacct
manual page, which has also been saved to a text file you can copy for your own reference: /jhpce/shared/jhpce/slurm/docs/job-states.txt
Warning
The overall job and each step has its own job state code. They often differ!! The -X
flag to sacct
will show you only the overall job state, such as FAILED, which is useful for most cases. However, sometimes you need to check the state of all of a jobs steps in order to see that a "batch" step ran OUT_OF_MEMORY.
LIFE OF A JOB AND ASSOCIATED STATES¶
Until we draw a nice diagram, a written description will have to suffice. Short job state names are specified in parentheses below.
- User submits job
- SLURM evaluates syntax and resource requests.
- If problems found, then job is rejected immediately.
- Otherwise it is accepted and becomes PENDING (PD).
- SLURM's scheduler algorithms begin testing where your job will "fit" in its planning process for the future. There will be more slots where smaller jobs (in RAM, CPU and duration) will fit than larger ones.
- PENDING jobs can remain pending (see reasons below), CANCELLED (CA) or be dispatched to compute node(s) to start RUNNING (R).
- RUNNING jobs can immediately run into a problem due to coding errors and become FAILED (F).
- RUNNING jobs can run correctly but then exceed their memory allocation and become OUT_OF_MEMORY (OOM).
- RUNNING jobs can run correctly but run into their wall-clock time limit and become DEADLINE (DL) or FAILED.
- RUNNING jobs can run correctly, switch to COMPLETING (CG) as processes are quitting, and then COMPLETED (CD).
If you are very curious, the SLURM vendor describes how jobs are started and terminated in this document.
PENDING JOB REASONS¶
These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed. These reasons are seen in the output of squeue
and scontrol show job <jobid>
The main pending reasons you will see:
Name | Explanation | Notes |
---|---|---|
BeginTime | The job's earliest start time has not yet been reached | |
Dependency | Job waiting on a user-defined dependency | See why with showjob <jobid>|grep -i depend |
DependencyNeverSatisfied | Something went wrong | You should investigate why. Incorrect dependency specified? |
JobArrayTaskLimit | Array job configured to run only so many tasks at a time | Good way to control resource use |
JobHeldAdmin | The job is held by a system administrator | |
JobHeldUser | The job is held by the user | |
Priority | One or more higher priority jobs exist for this partition or advanced reservation | |
QOSJobLimit | The job's QOS has reached its maximum job count | |
QOSMaxCpuPerUserLimit | Your other running jobs have consumed your CPU quota | slurmuser --me will show your used/pending resources |
QOSMaxMemoryPerUser | Your other running jobs have consumed your RAM quota | slurmuser --me will show your used/pending resources |
Reservation | Job waiting for its advanced reservation to become available | |
Resources | Needed resources not currently available in partition | slurmpic -p <partition> can show you what's used & consumed in that partition |
ReqNodeNotAvail | Some node specifically required by the job is not currently available | Usu seen when all nodes in a partition are unavailable |
Click for a mostly-complete list of pending reasons
AssocGrp*Limit The job's association has reached an aggregate limit on some resource.
AssociationJobLimit The job's association has reached its maximum job count.
AssocMax*Limit The job requests a resource that violates a per-job limit on the requested association.
AssociationResourceLimit The job's association has reached some resource limit.
AssociationTimeLimit The job's association has reached its time limit.
BadConstraints The job's constraints can not be satisfied.
BeginTime The job's earliest start time has not yet been reached.
Cleaning The job is being requeued and still cleaning up from its previous execution.
Dependency This job has a dependency on another job that has not been satisfied.
DependencyNeverSatisfied This job has a dependency on another job that will never be satisfied.
FrontEndDown No front end node is available to execute this job.
InactiveLimit The job reached the system InactiveLimit.
InvalidAccount The job's account is invalid.
InvalidQOS The job's QOS is invalid.
JobHeldAdmin The job is held by a system administrator.
JobHeldUser The job is held by the user.
JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc.
Licenses The job is waiting for a license.
NodeDown A node required by the job is down.
NonZeroExitCode The job terminated with a non-zero exit code.
PartitionDown The partition required by this job is in a DOWN state.
PartitionInactive The partition required by this job is in an Inactive state and not able to start jobs.
PartitionNodeLimit The number of nodes required by this job is outside of its partition's current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job's time limit exceeds its partition's current time limit.
Priority One or more higher priority jobs exist for this partition or advanced reservation.
Prolog Its PrologSlurmctld program is still running.
QOSGrp*Limit The job's QOS has reached an aggregate limit on some resource.
QOSJobLimit The job's QOS has reached its maximum job count.
QOSMax*Limit The job requests a resource that violates a per-job limit on the requested QOS.
QOSResourceLimit The job's QOS has reached some resource limit.
QOSTimeLimit The job's QOS has reached its time limit.
QOSUsageThreshold Required QOS threshold has been breached.
ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.
Reservation The job is waiting its advanced reservation to become available.
Resources The job is waiting for resources to become available.
SystemFailure Failure of the Slurm system, a file system, the network, etc.
TimeLimit The job exhausted its time limit.
WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.
That mostly-complete list comes from this section which refers to yet another page for THE complete list here
-
One does not have to use the bash shell for interactive sessions or to execute your batch script. You can specify other shells or interpreters. ↩
-
(It is unclear to the author whether an interactive job officially has a step. No steps are listed for them in the sacct output, but he suspects that, technically, in the vendor's documentation, it does have a step. Because tasks are implemented by job steps.) ↩
-
If there are more than one nodes, one should look on the 0th node for information about the job in /var/log/slurm/slurmd.log file when troubleshooting. ↩