Skip to content

sacct useful command examples

Caution

If you have not already read about basic job facts like job notation, steps, names and states, please first visit the "About SLURM Jobs" document here.

Sacct Overview

Example

Show my failed jobs between noon and now
sacct -s F -o "user,jobid,state,nodelist,start,end,exitcode" -S noon -E now

More examples can be found throughout this document, as well as at the end.

sacct is a command used to display information about jobs. It has a number of subtleties, such as the time window reported on and the formatting of output. We hope that this page will help you get the information you need.

sacct can be used to investigate jobs' resource usage, nodes used, and exit codes. It can point to important information, such as jobs dying on a particular node but working on other nodes1.

sacct will show all submitted jobs but cannot, of course, provide data for a number of fields until the job has finished. Use the sstat command to get information about running programs. "Instrumenting" your jobs to gather information about them can include adding one or more sstat commands to batch jobs in multiple places.

Tip

Much of the information on this page can be used with sstat, but there are differences, particularly in available output fields (compare the output of sacct -e and sstat -e).

Examples below use angle brackets < > to indicate where you are supposed to replace argumements with your values.

sacct basics

  1. By default only your own jobs are displayed. Use the --allusers or -a flag if necessary.
  2. Only jobs from a certain time window are displayed by default. That window varies in a confusing manner depending the arguments you provide. See this section of the manual page. Therefore it is recommended to always provide start (-S) and end (-E) times to be sure that you are seeing what you expect.
  3. You can choose output fields and control their width.
  4. Even the simplest of batch jobs contain multiple "steps" as far as SLURM is concerned. One of them, named "extern" represents the ssh to the compute node on behalf of your job. Job records consist of a primary entry for the job as a whole as well as entries for job steps. The Job Launch page has a more detailed description of each type of job step. You may find the -X flag helpful to omit clutter.
  5. Regular jobs are in the form: JobID[.JobStep]
  6. Array jobs are in the form: ArrayJobID_ArrayTaskID
  7. Jobs have multiple steps!! (Explained here.)

Warning

Sacct retrieves data from a SQL database. Be careful when creating your sacct commands to limit the queries to the information you need. Narrow the search as much as possible. That database needs to be modified constantly as jobs start and complete, so we don't want it tied up answering sacct queries. If you want to look at a large amount of data in a variety of ways, consider saving the output to a text file and then working with that file.

Command Options of Note

Check the man page. There are other useful options.

  • -X show stats for the job allocation itself, ignoring steps (try it)
  • -R reasonlist show jobs not scheduled for given reason
  • -a allusers
  • -N nodelist only show jobs which ran on this/these nodes
  • -u userlist only show jobs which ran by this/these users
  • --name=namelist - only show jobs with this list of names
  • -n noheader
  • -p parsable puts a | between fields and at end of line
  • -P parsable2 does not put a | at end of line
  • --delimeter - use that char instead of | for -p or -P
  • --units=[KMGTP] - display in this unit
  • -k minimum time - looking for jobs with time limits in a range
  • -K maximum time - looking for jobs with time limits in a range
  • -q qoslist - list of qos used

Sorting or processing output

sacct can output with other delimiters if you specify either -p or -P and --delimiter=<characters>

sacct does not have a sort option. You need to sort its output by other methods. If you want to change the order of the information, specify the field names in the desired order.

This is a prime example of where it is kindest to the SLURM master daemon if you run your query such that you store the results in a text file, then work with the text file contents. Rather than running many variations of a command pipeline beginning with sacct when you already have all of the desired output option fields and are just trying to figure out the right text-processing logic.

sort failed jobs by exitcode
sacct -X -a -s F -o "user,jobid,state,nodelist,exitcode" -S noon -E now|sort -k5| less
sort failed jobs by exitcode, then by nodelist
sacct -X -a -s F -o "user,jobid,state,nodelist,exitcode" -S noon -E now|sort -k5,5 -k4,4| less

Useful sort options:

  1. -kKEYDEF or --key=KEYDEF Given just a number as the KEYDEF, it sorts by column
  2. -ror --reverse
  3. -n or --numeric-sort

See the sort manual page for other options, including multiple kinds of numeric sorts, as well as the syntax of KEYDEF.

Start and End Times

It is best to use always specify a -S start time and a -E end time.

Special time words: today, midnight, noon, now

Positive and negative deltas from now can be used, which is very handy, but remember to use the full unit word (e.g. "days" not "d" or "day")!!!

now[{+|-}count[seconds(default)|minutes|hours|days|weeks]]

Examples:
now-3days
now-2hours

Valid time formats are (spare brackets indicate optional elements):

               HH:MM[:SS][AM|PM]
               MMDD[YY][-HH:MM[:SS]]
               MM.DD[.YY][-HH:MM[:SS]]
               MM/DD[/YY][-HH:MM[:SS]]
               YYYY-MM-DD[THH:MM[:SS]]
Examples:
0408 # April 8th
09:15 # 9:15am (24-hour time is assumed)
09:15pm # 9:15pm
-S $(date -d '21 days ago' +%D-%R) -E $(date -d '17 day ago' +%D-%R) # the date command can interpret many human-readable expressions, then express them using the format mentioned afterwards. Cool, eh!!!
04/15 # the bash shell will probably have problems with the forward slash unless you surround the string with double quotes

Job Time Limit

If you want to filter for jobs with a certain time limit, use one or both of the -k/--timelimit-min and -K/--timelimit-max flags. To show only jobs with time limit between 48 and 72 hours that ran between 4 and 8 weeks ago:

sacct -S $(date -d '8 weeks ago' +%D-%R) -E $(date -d '4 weeks ago' +%D-%R) -k 48:00 -K 72:00

If you know the exact time limit of the jobs you are looking for, set both min and max time limit to the same value. For jobs with a time limit of 30 minutes that ran in the last month:

sacct -S $(date -d 'last month' +%D-%R) --timelimit-min 30 --timelimit-max 30

Job State Values

Using the -s <states> option, you can prune your search by looking for only jobs which match the state you need, such as F for failed. (All of these work: f, failed, F, FAILED). You can specify more than one state if you separate them with commas. Note that you use a different flag for state with the squeue command, -t <states>.

Job states have short names consisting of one or two letters, and a full name. You can use either form when working with SLURM commands. They are shown here capitalized for emphasis but can be specified as lower-case.

Warning

Different steps of a job can have different end states. For example the "extern" step is often COMPLETED when the "batch" and overall steps are FAILED. The -X flag to sacct will show you only the overall job state, such as FAILED, which is useful for most cases. However, sometimes you need to check the state of all of a jobs steps in order to see that a "batch" step ran OUT_OF_MEMORY.

Primary job states of interest:

Short Name Long Name Explanation
PD PENDING Job is waiting to start
R RUNNING Job is currently running
CG COMPLETING Job has ended, clean-up has begun
CD COMPLETED Job finished normally, with exit code 0
F FAILED Job finished abnormally, with a non-zero exit code
CA CANCELLED Job was cancelled by the user or a sysadmin
OOM OUT_OF_MEMORY Job was killed for exceeding its memory allocation
TO TIMEOUT Job was killed for exceeding its time limit
Click for a complete list of job states

FROM https://slurm.schedmd.com/squeue.html#lbAG

BF BOOT_FAIL
Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
CA CANCELLED
Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED
Job has terminated all processes on all nodes with an exit code of zero.
CF CONFIGURING
Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
CG COMPLETING
Job is in the process of completing. Some processes on some nodes may still be active.
DL DEADLINE
Job terminated on deadline.
F FAILED
Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL
Job terminated due to failure of one or more allocated nodes.
OOM OUT_OF_MEMORY
Job experienced out of memory error.
PD PENDING
Job is awaiting resource allocation.
PR PREEMPTED
Job terminated due to preemption.
R RUNNING
Job currently has an allocation.
RD RESV_DEL_HOLD
Job is being held after requested reservation was deleted.
RF REQUEUE_FED
Job is being requeued by a federation.
RH REQUEUE_HOLD
Held job is being requeued.
RQ REQUEUED
Completing job is being requeued.
RS RESIZING
Job is about to change size.
RV REVOKED
Sibling was removed from cluster due to other cluster starting the job.
SI SIGNALING
Job is being signaled.
SE SPECIAL_EXIT
The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.
SO STAGE_OUT
Job is staging out files.
ST STOPPED
Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
S SUSPENDED
Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO TIMEOUT
Job terminated upon reaching its time limit.

That complete list comes from this section of the sacct manual page, which has also been saved to a text file you can copy for your own reference: /jhpce/shared/jhpce/slurm/docs/job-states.txt

Output Fields of Interest

All sacct fields (output of sacct -e)
Account             AdminComment        AllocCPUS           AllocNodes         
AllocTRES           AssocID             AveCPU              AveCPUFreq         
AveDiskRead         AveDiskWrite        AvePages            AveRSS             
AveVMSize           BlockID             Cluster             Comment            
Constraints         ConsumedEnergy      ConsumedEnergyRaw   Container          
CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode    
Elapsed             ElapsedRaw          Eligible            End                
ExitCode            Flags               GID                 Group              
JobID               JobIDRaw            JobName             Layout             
MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite       
MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode       
MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask         
MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel           
MinCPU              MinCPUNode          MinCPUTask          NCPUS              
NNodes              NodeList            NTasks              Partition          
Priority            QOS                 QOSRAW              Reason             
ReqCPUFreq          ReqCPUFreqGov       ReqCPUFreqMax       ReqCPUFreqMin      
ReqCPUS             ReqMem              ReqNodes            ReqTRES            
Reservation         ReservationId       Reserved            ResvCPU            
ResvCPURAW          Start               State               Submit             
SubmitLine          Suspended           SystemComment       SystemCPU          
Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID                
User                UserCPU             WCKey               WCKeyID
What output fields are available?
sacct -e
See all fields for a job
sacct -o ALL -j <jobid>

The following fields are probably the ones you'll want. See this section of the manual page for the list and their meaning. Capitalization does not matter; it is used for readability.

  • TRES means Trackable RESources, such as RAM and CPUs.
  • A number of fields (not listed) are available to tell you on which node a maximum occurred. Similarly there are fields to tell you minimum, average and maximum values for some items.
  • User
  • JobiId
  • JobName
  • Partition
  • State
  • ExitCode
  • Submit
  • Start
  • Elapsed
  • End
  • AllocNodes
  • NNodes - number of nodes
  • NodeList
  • ReqTRES # this is what you will be billed for
  • ReqNodes
  • ReqCPUS
  • TRESUsageInTot
  • CPUTime - (elapsed)*(AllocCPU) in HH:MM:SS format
  • MaxRSS - Max resident set of all tasks in job
  • MaxVMSize - Max virtual memory of all tasks in job
  • MaxDiskRead - Number bytes read by all tasks in job
  • MaxDiskWrite - Number bytes written by all tasks in job

About Memory Fields

Virtual Memory Size (VMSize) is the total memory size of a job. It includes both memory actually in RAM (the RSS) and parts of executabilities which were not needed to be read in off of disk into RAM. Because, for example, routines in dynamically linked libraries were never called, so those libraries were not loaded.

Resident set size (RSS) is the portion of memory (measured in megabytes) occupied by a job that is held in main memory (RAM). The rest of the memory required by the job exists in the swap space or file system, either because some parts of the occupied memory were paged out, or because some parts of the executable were never loaded.

Formatting fields

By default fields are 20 characters wide. That is often insufficient.

You can put a "%NUMBER" after a field name to specify how many characters should be printed, e.g.

  • format=name%30 will print 30 characters of field name right justified.
  • format=name%-30 will print 30 characters left justified.

You can specify your format on the command line or define an environment variable to hold the desired string (see below).

Using Environment Variables

You can define environment variables in your shell to reduce the complexity of issuing sacct commands. You can also set these in shell scripts. Command line options will always override these settings.

SACCT_FORMAT

SLURM_TIME_FORMAT

Formatting Dates/Times

You can use most variables defined by the STRFTIME(3) system call. This web page is a starting point, but what SLURM has chosen to implement may not match.

  • %a - abbrieviated name of day of the week
  • %m - month as decimal, 01 to 12
  • %d - day of month as decimal
  • %H - hour as decimal in 24-hour notation
  • %M - minute as decimal, 00 to 59
  • %T - time in 24-hour notation (%H:%M:%S)

Day of week MM-DD HH:MM
export SLURM_TIME_FORMAT="%a %m-%d %H:%M" 
The start and end field widths show below are suitable for the time format shown above.

Resources requested, used
export SACCT_FORMAT="user,jobid,jobname,nodelist%12,start%-20,end%-20,state%20,reqtres%40,TRESUsageInTot%200"

Exit Error Codes

In addition to the job's "state", SLURM also records error codes. Unfortunately the vendor's Job Exit Codes page doesn't provide a meaning for the numerical values.

Error 0:53 often means that something wasn't readable or writable. For example, job output or error files couldn't be written in the directory in which the job ran (or where you told SLURM to put them with a directive).

a guide for exit codes:

0 → success
non-zero → failure
Exit code 1 indicates a general failure
Exit code 2 indicates incorrect use of shell builtins
Exit codes 3-124 indicate some error in job (check software exit codes)
Exit code 125 indicates out of memory
Exit code 126 indicates command cannot execute
Exit code 127 indicates command not found
Exit code 128 indicates invalid argument to exit
Exit codes 129-192 indicate jobs terminated by Linux signals
For these, subtract 128 from the number and match to signal code
Enter kill -l to list signal codes
Enter man signal for more information

Diagnostic Arguments

These can be useful to double-check what someone actually did.

See the full command issued to submit the job
sacct -o SubmitLine%250 -j <jobid> # may need to increase field width
See batch file used
sacct -B -j <jobid>
Directory used by the job to execute commands
sacct -o WorkDir -j <jobid>
See jobs given a time limit btwn 1min & 1 day
sacct -k 00:01 -K 1-0

Examples

All jobs for username bob that (ran with a wall time of at least 2 days) and (were killed for running out of memory) in the past 3 months:

sacct --user bob --starttime $(date -d '3 months ago' +%D-%R) --state OUT_OF_MEMORY --timelimit-min 2-00:00:00 --format JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode

HOW MANY JOBS FAILED WITH OUT-OF-MEMORY ERRORS IN THE LAST TWO WEEKS?"

sacct -n -S now-14days -E now -X -s OOM | wc -l

HOW MANY JOBS FAILED WITH OTHER KINDS OF ERRORS IN THE LAST TWO WEEKS?"

sacct -n -S now-14days -E now -X -s F | wc -l

CPU & RAM expended by jobs dying from out-of-memory errors in last 2 hours

sacct -n -S now-2hours -E now -u jzhou1 -X -s OOM -o cputimeraw,reqmem --units=G| awk '{timesum+=$1;ramsum+=$2} END{printf "%s (CPU hrs) \t%s (GB)\n",timesum/60, ramsum}'


  1. In which case you can add the directive --exclude=compute-xxx to your job submission, then notify us via bitsupport so we can fix that node.