Array jobs allow multiple instances of a program to be run via a single qsub
command. This can often be more convenient than running numerous repetitive qsub
s of the same program. The different instances of the job that get run are known as “tasks”. These task values are numeric, and are specified by using the "-t START-END"
option to qsub
. The specific task is referenced within the qsub
script via the $SGE_TASK_ID
environment variable.
As an example, suppose you have 3 data files you want to run your program against:
$ ls data* data1 data2 data3
In this simple example, the SGE script simply "cat"
s each file.
$ more script1.sh #$ -cwd FILENAME="data$SGE_TASK_ID" cat $FILENAME exit 0
When the job is submitted, the "-t"
option is used to specify the range of tasks to be run, so in our example, the command to submit 3 tasks, numbered 1, 2, and 3 would be "qsub -t 1-3 script1.sh"
. Within the script, the $SGE_TASK_ID variable will be assigned to 1, 2, and 3 for the 3 instances of the script that gets run.
$ qsub -t 1-3 script1.sh Your job-array 5204694.1-3:1 ("script1.sh") has been submitted $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ---------------------------------------------------------------------------------------------- 5204694 0.00000 script1.sh mmill116 qw 06/27/2018 18:12:56 1 1-3:1 $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ---------------------------------------------------------------------------------------------- 5204694 0.59661 script1.sh mmill116 r 06/27/2018 18:12:59 shared.q@compute-087 1 1 5204694 0.54831 script1.sh mmill116 r 06/27/2018 18:12:59 shared.q@compute-086 1 2 5204694 0.53220 script1.sh mmill116 r 06/27/2018 18:12:59 shared.q@compute-054 1 3 $ qstat $ ls data1 data3 script1.sh.e5204694.1 script1.sh.e5204694.3 script1.sh.o5204694.2 data2 script1.sh script1.sh.e5204694.2 script1.sh.o5204694.1 script1.sh.o5204694.3
The result of running this qsub would be 3 output files, where each output file has the task ID appended to it.
Now consider a more complicated scenario where the file names are not neatly numbered. One way to handle this situation is to create a file that contains a list of the files, and then use the $SGE_TASK_ID
number to refer to the line number of the entry in that file to get to the file name. For this example, let’s say we have 3 files:
$ ls first second third
We could create a file list using the “ls” command…
$ ls > file-list $ cat file-list first second third
We can now create and SGE script that uses the awk command to pull out the line number from file-list
based on the value of $SGE_TASK_ID
(there are of course numerous other options to use in Unix instead of awk
).
$ cat script2.sh #$ -cwd FILENAME=`awk "NR==$SGE_TASK_ID {print $1}" file-list` cat $FILENAME exit 0 $ qsub -t 1-3 script2.sh
By submitting this array job, 3 instances of the script2.sh
script would get run, where each instance would access the filename from the file-list
file, where the line number in file-list
matches the value of $SGE_TASK_ID
. As in our previous example, 3 output files would get created by the 3 tasks, and each output file would contain the contents of the respective input file.