Skip to main content

Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions: allocating resources to users, providing a framework for starting and monitoring jobs, and arbitrating contention for resources by managing a queue of pending work.

Interactive Sessions

Interactive sessions allow users to work directly on compute nodes with real-time access to resources. These are useful for development, debugging, and exploratory work that requires immediate feedback.

Move to the Node

We can request a job with two compute nodes and 48 tasks for thirty minutes.

[..]$ srun -N 2 -n 48 -t 30 --pty /bin/bash
# This brings you to the node directly

The -N 2 flag specifies two nodes, -n 48 requests 48 tasks (CPU cores), and -t 30 sets a time limit of 30 minutes. The --pty option creates a pseudo-terminal, allowing interactive use. Once executed, you'll be logged into one of the allocated compute nodes and can run commands interactively.

Recall that we have more information with srun --help, along with the manual pages man srun which may be accessed here.

Request Nodes

With salloc we do not get shifted to the node, this is useful for callbacks.

The salloc command requests a resource allocation but doesn't automatically log you into the compute node. Instead, it creates a resource allocation and then allows you to run multiple srun commands within that allocation. This is particularly useful for workflows that involve multiple parallel steps or when you need to maintain the same resource environment across multiple commands.

[..]$ salloc -N 1 -n 16 -t 60
# Allocation granted, but you remain on the login node
[..]$ srun python my_script.py # Runs on the allocated compute resources
[..]$ srun python another_script.py # Same allocation, different command

When done with your allocation, release the resources with the exit command.