Scheduler and job submission (SLURM)¶
Overview¶
UCT HPC uses SLURM to manage how work runs on the system.
Jobs are submitted to a queue and executed on compute nodes.
Work should not be run directly on the login node.
Key commands¶
Partitions¶
Compute resources are grouped into partitions.
Common partitions include:
ada— general CPU workloadscurie— additional CPU capacityl40s,a100— GPU-enabled nodes
Check current partitions and availability:
Job script¶
#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --partition=ada
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/job-%j.out
#SBATCH --error=logs/job-%j.err
module purge
module load python/miniconda3-py3.12
python script.py
Resource requests¶
Each job defines its required resources.
- more resources than needed → longer queue times
- fewer resources than needed → job failure or reduced performance
Shorter jobs are often scheduled sooner.
Job lifecycle¶
- submit job (
sbatch) - wait in queue
- run on compute node
- complete or fail
Check output:
Interactive jobs¶
Use for short or exploratory work:
Login node usage¶
The login node is a shared access point.
Use it for:
- connecting
- editing files
- submitting jobs
Do not use it for:
- running computations
- long-running processes
Processes that impact other users may be terminated.
Good practice¶
- test jobs with small inputs
- request only what you need
- use logs to diagnose issues