Scheduler and job submission (SLURM)¶

Overview¶

UCT HPC uses SLURM to manage how work runs on the system.

Jobs are submitted to a queue and executed on compute nodes.
Work should not be run directly on the login node.

Key commands¶

sbatch job.sh
squeue -u $USER
sacct -j <jobid>
scancel <jobid>
sinfo

Partitions¶

Compute resources are grouped into partitions.

Common partitions include:

ada — general CPU workloads
curie — additional CPU capacity
l40s, a100 — GPU-enabled nodes

Check current partitions and availability:

sinfo

Job script¶

#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --partition=ada
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/job-%j.out
#SBATCH --error=logs/job-%j.err

module purge
module load python/miniconda3-py3.12

python script.py

Resource requests¶

Each job defines its required resources.

more resources than needed → longer queue times
fewer resources than needed → job failure or reduced performance

Shorter jobs are often scheduled sooner.

Job lifecycle¶

submit job (sbatch)
wait in queue
run on compute node
complete or fail

Check output:

cat logs/job-<jobid>.out
cat logs/job-<jobid>.err

Interactive jobs¶

Use for short or exploratory work:

srun --partition=ada --cpus-per-task=2 --mem=8G --time=01:00:00 --pty bash

The login node is a shared access point.

Use it for:

connecting
editing files
submitting jobs

Do not use it for:

running computations
long-running processes

Processes that impact other users may be terminated.

Good practice¶

test jobs with small inputs
request only what you need
use logs to diagnose issues