reportseff useful command examples¶
Caution
If you have not already read about basic job facts like job notation, steps, names and states, please first visit the "About SLURM Jobs" document here.
Reportseff Overview¶
seff
is a program which looks up accounting data using the `sacct command to show some stats for a single completed job.
Example seff output
Job ID: 9709
Cluster: cms
User/Group: ttotoj/users
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:47:34
CPU Efficiency: 97.74% of 00:48:40 core-walltime
Job Wall-clock time: 00:48:40
Memory Utilized: 67.40 GB
Memory Efficiency: 22.47% of 300.00 GB
Reportseff
is a python script which makes displaying job efficiency easier for other cases such as "all of a user's jobs for a time period" or "all of the elements of an array job". It also uses sacct
to retrieve information. It was written so that you could use the same kinds of syntax as you would with sacct
. For example when indicating time ranges or asking for changes to the output format. You can also pass other arguments straight to sacct
if desired.
Home page for more information: https://github.com/troycomi/reportseff
Example reportseff output
module load reportseff
export LESS="-R"
reportseff --since now-3days --until now -s CD --format +reqcpus,reqmem
JobID State Elapsed TimeEff CPUEff MemEff ReqCPUS ReqMem
12394 COMPLETED 00:44:58 0.2% 92.3% 113.0% 50 1600G
12403 COMPLETED 00:45:12 0.2% 91.2% 115.2% 50 1600G
12413 COMPLETED 00:00:04 0.0% --- 0.0% 70 1800G
12415 COMPLETED 00:00:03 0.0% --- 0.0% 70 1800G
12416 COMPLETED 00:00:03 0.0% --- 0.0% 75 1800G
12418 COMPLETED 00:03:04 0.0% 1.8% 0.9% 50 1500G
12419 COMPLETED 00:03:24 0.0% --- 0.0% 50 1500G
12435 COMPLETED 00:07:03 0.0% 49.1% 46.5% 20 500G
12455 COMPLETED 06:22:15 26.5% 9.6% 38.1% 1 50G
12464 COMPLETED 00:01:41 0.0% 16.6% 3.2% 15 350G
12468 COMPLETED 00:02:56 0.0% 48.1% 54.0% 10 250G
12469 COMPLETED 08:39:19 1.8% 96.3% 97.7% 10 400G
12503 COMPLETED 06:10:40 1.3% 91.7% 152.1% 10 380G
12531 COMPLETED 03:24:26 0.7% 86.8% 159.9% 10 350G
12539 COMPLETED 00:37:51 0.1% 5.1% 46.8% 20 100G
Second example reportseff output
module load reportseff
export LESS="-R"
reportseff --since noon --until now -s CD --format +user
JobID State Elapsed TimeEff CPUEff MemEff User
5011394 COMPLETED 19:16:18 40.1% 99.5% 130.4% mnagle
5013943 COMPLETED 14:27:08 30.1% 99.9% 127.9% mnagle
5016834 COMPLETED 07:48:51 16.3% 98.4% 124.0% mnagle
5024400 COMPLETED 05:54:33 24.6% 8.0% 60.3% enorton
5025014 COMPLETED 04:28:52 18.7% 9.2% 3.4% jzhang5
5025583 COMPLETED 03:57:29 16.5% 89.4% 26.1% sli1
5025584 COMPLETED 04:11:32 17.5% 89.1% 26.9% sli1
5025586 COMPLETED 03:55:38 16.4% 82.8% 26.4% sli1
There are two ways to use reportseff:
As a module:
module load reportseff
export LESS="-R" # to provide colorized output
As a bash function:
Define an alias on the command line or in your .bashrc
rs() { export LESS="-R";export PYTHONPATH=/jhpce/shared/jhpce/core/reportseff/2.3.2/;export PATH=/jhpce/shared/jhpce/core/reportseff/2.3.2/bin:$PATH; reportseff "$@"; }
Examples
module load reportseff
export LESS="-R"
reportseff --since noon --until now -s CD -u $USER
reportseff --since noon --until now -s CD -u $USER --format "user,jobid,state,nodelist,start,end,cpueff,memeff"
reportseff --since noon --until now -s CD -u $USER --format "+user,nodelist,start,end,cpueff,memeff"
you can define environment variables as you do with sacct to change your formatting. See here
Caution
you may not be able to see other user's job info unless you add --extra-args -a
Reportseff arguments, click for complete list
login31:~% reportseff --help
Usage: reportseff [OPTIONS] [JOBS]...
Main entry point for reportseff.
Options:
--modified-sort If set, will sort outputs by modified time
of files
--color / --no-color Force color output. No color will use click
defaults
--format TEXT Comma-separated list of columns to include.
Options are any valid sacct input along with
CPUEff, MemEff, Energy, and TimeEff. In
systems with jobstat caching, GPU usage can
be added with GPUEff, GPUMem or GPU (for
both). A width and alignment may optionally
be provided after "%", e.g. JobID%>15 aligns
job id right with max width of 15
characters. Generally
NAME[[%:][ALIGNMENT][WIDTH[e$]?]]. When an
`e` or `$` is present after a width
argument, the output will be truncated to
the right.Prefix with a + to add to the
defaults. A single format token will
suppress the header line. Wrap in quotes to
pass a string literal, otherwise alignment
may be misinterpreted.
--slurm-format TEXT Filename pattern passed to sbatch. By
default, will handle patterns like
slurm_%j.out, %x_%j, or slurm_%A_%a. In
particular, the jobid is expected to start
with '_'. Setting this to the same entry as
used in sbatch will allow parsing slurm
outputs like `1234.out`. Array jobs must
have %A_%a to properly interface with sacct.
--debug Print raw db query to stderr
-u, --user TEXT Ignore jobs, return all jobs in last week
from user
--partition TEXT Only include jobs with the specified
partition
--extra-args TEXT Extra arguments to forward to sacct
-s, --state TEXT Only include jobs with the specified states
-S, --not-state TEXT Include jobs without the specified states
--since TEXT Only include jobs after this time. Can be
valid sacct or as a comma separated list of
time deltas, e.g. d=2,h=1 means 2 days, 1
hour before current time. Weeks, days,
hours, and minutes can use case-insensitive
abbreviations. Minutes is the minimum
resolution, while weeks is the coarsest.
--until TEXT Only include jobs before this time. Can be
valid sacct or as a comma separated list of
time deltas, e.g. d=2,h=1 means 2 days, 1
hour before current time. Weeks, days,
hours, and minutes can use case-insensitive
abbreviations. Minutes is the minimum
resolution, while weeks is the coarsest.
-n, --node / -N, --no-node Report node-level statistics. Adds `jobid`
to format for proper display.
-g, --node-and-gpu / -G, --no-node-gpu
Report each GPU for each node. Sets `node`
and adds `GPU` to format automatically.
-p, --parsable Output will be '|' delmited without a '|' at
the end.
--version Show the version and exit.
--help Show this message and exit.
sacct is a command used to display information about jobs. It has a number of subtleties, such as the time window reported on and the formatting of output. We hope that this page will help you get the information you need.
sacct
can be used to investigate jobs' resource usage, nodes used, and exit codes. It can point to important information, such as jobs dying on a particular node but working on other nodes1.
sacct will show all submitted jobs but cannot, of course, provide data for a number of fields until the job has finished. Use the sstat command to get information about running programs. "Instrumenting" your jobs to gather information about them can include adding one or more sstat
commands to batch jobs in multiple places.
Examples below use angle brackets < > to indicate where you are supposed to replace argumements with your values.
ARGUMENTS¶
One of the appeals of this tool is that it uses sacct of course but allows you to pass extra sacct args. And it uses the same syntax for time that sacct does.
So I was able to run reportseff -u someuser --since now-4days
I suppose that they chose their argument flags to avoid overlapping with those used by sacct, so you see things like --since
and --until
instead of sacct
's -S
and -E
But most sacct
arguments will work.
Note that you can add sacct
fields to reportseff
's default ones with a simple --format +fieldname,otherfield
GPU support?¶
Another feature of interest is supposedly the ability to calculate memory use efficiency for GPU jobs. If we discovered that many users were typically not using all of the memory on GPUs, it would open the door to partitioning GPUs into slices and therefore increasing the number of available GPUs for zero dollars. However it is the case that we would have to install other software (parts of the Princeton jobstats suite) to enable the acquisition of GPU usage data. We probably won't have time to do that.
COLORIZED OUTPUT¶
It seems to use less
as its pagination tool by default when output is longer than your terminal's size. Normally it produces colorized output which is very nice, as Borat would say. But the colors go away and you see the ugly escape characters unless you have the LESS environment variable defined to include -R
. So you may want to add to your ~/.bashrc
file a command like export LESS="-i -M -R"
(The -i
means "do case-insensitive searches, unless a capital is used", -M
means "make prompt show lines and current byte" and -R
means "show color text where ESC sequences are present".)
Adding the flag --color
did not help. There is a --no-color
which might be handy for generating reports lacking escape characters when redirecting the output to a file.
MEMORY USE ABOVE 100%¶
Like other tools I've used which provide memory efficiency stats, this one shows jobs where efficiency is above 100%. Which I perhaps mistakenly interpret as indicating that the user has exceeded their memory limits. (I think I've also seen this in pure sacct output where no one is trying to combine data and calculate anything.) DO THESE CASES REPRESENT INSTANCES OF THE USER'S PROCESSES GRABBING MORE MEMORY THAN "ALLOWED" FOR BRIEF-ENOUGH PERIODS THAT CGROUP LIMITS ARE NOT RESULTING IN KILLED PROCESSES? Or some other explanation that doesn't indicate that our understandings about memory stats or cgroups are flawed?
Ultimately I would like to be able to tell users / write in the docs an explanation for why you see values above 100% even if only to aid them in understanding how to utilize the info to make their cluster use better. (At least I've never seen any negative numbers!)
One explanation for memory numbers that are "off" that I've seen is that someone is using MaxVM instead of MaxRSS etc. I happen to see in output_renderer.py a line "MemEff": ["REQMEM", "NNodes", "AllocCPUS", "MaxRSS", "NTasks"]
which might imply which field is being used.
-
In which case you can add the directive
--exclude=compute-xxx
to your job submission, then notify us via bitsupport so we can fix that node. ↩