Submitting Jobs via Slurm

About Jobs

There are two ways you can submit jobs to a cluster: by using workflows or through any terminal or command-line interface. For the workflows option, please see Running Workflows.

After you’ve started a cluster, log in to the controller with your preferred method. The quickest way to submit a job is to transfer your file(s) to the cluster, then run the command sbatch.

In this example, we submitted the file demo_test1.sbatch with sbatch:

[demo@democluster-60 ~]$ ls
demo_test1.sbatch
[demo@democluster-60 ~]$ sbatch demo_test1.sbatch
Submitted batch job 2

After submitting a job, you can watch its progress with the command watch squeue, which will update every two seconds with the job's status in the ST column:

Every 2.0s: squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    4 test.part     test     demo CF       0:08      2 demo-democluster-00060-1-[0001-0002]

You can also use watch 'sinfo;echo;squeue' if you want to see general cluster information in addition to your job's progress:

Every 2.0s: sinfo; echo; squeue

PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
test.partition1*    up   infinite      2   mix# demo-democluster-00060-1-[0001-0002]
test.partition1*    up   infinite      3  idle~ demo-democluster-00060-1-[0003-0005]
test.partition2     up   infinite      5  idle~ demo-democluster-00060-2-[0001-0005]

         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             4 test.part     test     demo CF       0:26      2 demo-democluster-00060-1-[0001-0002]

When using watch squeue or watch 'sinfo;echo;squeue', the ST column will show CF while the node(s) configure. All of the rows beneath JOBID will clear when your job is finished:

Every 2.0s: sinfo; echo; squeue

PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
test.partition1*    up   infinite      2  idle% demo-democluster-00060-1-[0001-0002]
test.partition1*    up   infinite      3  idle~ demo-democluster-00060-1-[0003-0005]
test.partition2     up   infinite      5  idle~ demo-democluster-00060-2-[0001-0005]

         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Once the job is finished, you can check its output with cat file_name. Our file demo_test1.sbatch inlcuded instructions to send our completed job's data to an std.out file and any errors to an std.err file:

[demo@democluster-60 ~]$ ls
demo_test1.sbatch  std.err  std.out
[demo@democluster-60 ~]$ cat std.err
[demo@democluster-60 ~]$ cat std.out
demo-democluster-00060-1-0001
demo-democluster-00060-1-0001
demo-democluster-00060-1-0002
demo-democluster-00060-1-0002

Using cat std.err didn’t return anything because the job executed without errors.

Common Slurm Commands

This section gives a quick overview of the commands you’ll use most often when interacting with clusters. You can use any of these commands in any terminal after logging in to a controller node.

Because the PW platform uses Slurm to manage jobs, you can use any of their system commands. For an extensive list of those options, see Slurm’s command guide. You can also enter man in front of any command (such as man sacct) to see its description and a list of other available commands in Slurm’s virtual manual.

About Job IDs

When we say “job ID” in this section, we mean the job ID that Slurm assigns to your work, which will appear when running many of these commands. ID numbers in the Worflow Monitor and the jobs folder on the PW platform act as a separate identifier to help us track how many jobs we’ve ever run on the platform.

Using any of the commands in this section will generate a new Slurm job ID.

About Fault Tolerance

Fault tolerance is defined by how well an infrastructure remains functional or online even when there are service disruptions because of outages or natural disasters.

On the PW platform, cluster deletions are queue-based for fault tolerance.

The cluster startup process has no retries for fault tolerance, but the logs are visible so users can see any problems that occur.

For compute node startup requests, fault tolerance is implemented with retries via Slurm (by default, there is a new startup attempt approximately every 20 minutes).

Job Management

salloc

salloc retrieves resources for your job without executing any tasks.

Using this command retrieves resources before you need them by signaling the system to reserve a specified number of nodes. For example, salloc -N 2 will reserve two compute nodes, for a total of three nodes, including the controller.

salloc is useful if you’re sharing a cluster with other users in your organization: using this command means that once a job is finished, the allocated nodes will remain on reserve for your use until you disconnect from the cluster (meaning that your wait times will be shorter because another user cannot take control of your allocated nodes, so you won’t have to wait for more nodes to become available or wait for them to start once they’re available).

sbatch

sbatch submits a job script that will execute later. You can also configure nodes with sbatch by adding these options:

--n-tasks-per-node to specify the number of CPUs
-t to specify the maximum amount of time you want these resources to run with the format of 0:0:0 for hours, minutes, and seconds

For example, sbatch demo_test1.sbatch --n-tasks-per-node 5 -t 3:0:0 would run the file demo_test1.sbatch and request 5 CPUs for 3 hours of maximum run time.

srun

srun executes a job script. You can use the same options from salloc and sbatch with srun:

-N to specify the number of nodes
--n-tasks-per-node to specify the number of CPUs
-t to specify the maximum amount of time you want these resources to run

For example, srun -N 1 --pty bash would request 1 compute node and open a pseudoterminal, creating an interactive command-line session.

scancel

scancel paired with a job ID ends a pending or running job or job step. For example:

[demo@democluster-60 ~]$ sbatch demo_test1.sbatch
Submitted batch job 6
[demo@democluster-60 ~]$ scancel
scancel: error: No job identification provided
[demo@democluster-60 ~]$ scancel 6

If you cancel a job, it will disappear from your queue.

Cluster Management

sinfo

sinfo shows information about the nodes and partitions you’re using. By default, sinfo displays partition names, availability, time limit, the number of nodes, state, and the node’s ID number (which is displayed as username-democluster-00019-1-[0001-0005]).

Please note that if you enter sinfo without setting up partitions, you’ll receive the error message slurm_load_partitions: Unable to contact slurm controller (connect failure).

squeue

squeue shows a list of running and pending jobs. By default, squeue shows job ID number, partition, username, job status, number of nodes, and node names for all queued and running jobs. You can also use these commands to adjust squeue’s output:

--user to see only one user’s jobs, such as --user=yourPWusername
--long to show non-abbreviated information and add the field timelimit
--start to estimate a job’s start time

Notification Management

About Notifications

This section applies only to cloud clusters, not on-premises clusters.

By default, cloud clusters will send job start/finish notifications to the PW platform. You can change that setting or add it as an email notification by following the steps in Managing Notifications.

To enable additional job status notifications, you can also pair the flag --mail-type with the commands salloc, sbatch, or srun. For example, the command sbatch --mail-type=FAIL exampleScript.sbatch will send a notification if your job fails to start or complete.

You can add multiple notification events to the --mail-type flag at once and separate them with commas: sbatch --mail-type=BEGIN,END exampleScript.sbatch

Alternatively, you can add tags inside a Slurm batch file, as seen in this example:

#!/bin/bash

#SBATCH --mail-type=BEGIN,END

echo "Hello, World!"

Which method should I use?

Both methods above work equally well.

The primary difference is that entering the flag and notification event(s) outside the file will override any settings inside of your batch script, but will not cause anything to be written into the file.

Notification Events

The table below lists the events currently supported by the --mail-type flag.

Type	Notification Event
`ALL`	equivalent to `BEGIN,END,FAIL,INVALID_DEPEND,REQUEUE,STAGE_OUT`
`NONE`	does not send notifications; this is the default
`BEGIN`	job start
`END`	job end
`FAIL`	job failure
`REQUEUE`	job is requeued
`INVALID_DEPEND`	a job’s dependency cannot be satisfied, so the job will not run
`STAGE_OUT`	when a job has completed or been cancelled, but has not yet released its resources
`TIME_LIMIT_50`	when a job reaches 50% of its walltime* limit
`TIME_LIMIT_80`	when a job reaches 80% of its walltime* limit
`TIME_LIMIT_90`	when a job reaches 90% of its walltime* limit
`TIME_LIMIT`	when a job reaches its walltime* limit
`ARRAY_TASKS`	sends other option notifications for each array task instead of for the array as a whole; without this option, `BEGIN`, `END`, and `FAIL` notifications will only notify once for the full array instead of sending a notification for each individual array task

*The walltime limit is the user set limit for how long a job can run.

Please note that walltime limits are infinite by default. A walltime limit can be added when starting a job.

Troubleshooting

sacct

sacct shows a summary of users as well as completed and running jobs. Using this command will display a table with a job’s ID number, name, partition, status, exit code, whose account it’s running on, and how many CPUs it’s using.

For troubleshooting purposes, the State and ExitCode fields from running sacct are especially useful for determining whether a node has failed and, if so, why. If you reach out to us for help, one of our support engineers may ask you for the information you see after running sacct.

scontrol

scontrol can delegate commands to specific job IDs and nodes. Please note that many scontrol commands can only be executed as user root. You can use these commands with a job ID to adjust scontrol’s output:

suspend to pause a job's processes
resume to continue a job's processes
hold to make a job a lower priority, putting it “on hold” so higher priority jobs will run first
release to remove a job from the hold list
show job to get detailed information about a job

Submitting Jobs via Slurm

Common Slurm Commands​

Job Management​

salloc​

sbatch​

srun​

scancel​

Cluster Management​

sinfo​

squeue​

Notification Management​

Notification Events​

Troubleshooting​

sacct​

scontrol​