Basic usage
This section will be split into two parts the first being for absolute linux command line beginners and the second being for the first time working with modules and slurm on an HPC cluster.
Entry to the linux command line
First time logging into the cluster can be daunting, how do you do anything?
First things first: When you have connected to Elja you will be greeted with welcome message and your prompt will be at the bottom, that will look something like this:
[user@elja-irhpc ~]$
Most of the commands that are featured in this guide have a --help flag or a man page. To use those either call, for example we will use mkdir, mkdir --help or man mkdir
Creating your first files
You are now in your home directory, lets create your first files and directories:
Here we will use the following commands:
- mkdir: Mkdir stands for make directory and creates a directory
- touch: Touch will create a generic file, you can put whatever file ending you want
- cd: Cd stands for "Change Directory" and moves us between directories
- ls: ls lists each file or directory in your current directory
- pwd: Shows us which directory we are in
- echo: Echo prints out what ever text you put in
- cat: cat prints the contents of a file
[user@elja-irhpc ~]$ mkdir first_dir
[user@elja-irhpc ~]$ ls
first_dir
[user@elja-irhpc ~]$ cd first_dir
[user@elja-irhpc first_dir]$ pwd
/hpchome/user/first_dir
[user@elja-irhpc first_dir]$ touch first_file.txt
[user@elja-irhpc first_dir]$ echo "My first text" >> first_file.txt
[user@elja-irhpc first_dir]$ cat first_file.txt
My first text
Editing files
After creating your first files you might want to edit them, change their name or contents
Here we will use the following commands:
- mv: moves and/or renames files
- rm: deletes files
- vimtutor: Tutor to learn the basics of vim
- vim: A text editor (see cheatsheet)
Now lets say you want to rename your directory and remove your text file and make a new shell script.
[user@elja-irhpc first_dir]$ cd
[user@elja-irhpc ~]$ ls
first_dir
# mv can be used to move files to different locations or rename files
[user@elja-irhpc ~]$ mv first_dir/ my_scripts/
[user@elja-irhpc ~]$ ls
my_scripts
[user@elja-irhpc ~]$ cd my_scripts
[user@elja-irhpc my_scripts]$ ls
first_file.txt
[user@elja-irhpc my_scripts]$ rm first_file.txt
[user@elja-irhpc my_scripts]$ ls
..
[user@elja-irhpc my_scripts]$ touch script.sh
[user@elja-irhpc my_scripts]$ ls
script.sh
Now we have to add something to the file, this time we are going to use the VIM text editor. To get you started on using vim please use the vimtutor command to learn the basics of working with textfiles in vim.
[user@elja-irhpc my_scripts]$ vimtutor
Alright now that you know how to use vim we can open our script.
[user@elja-irhpc my_scripts]$ vim script.sh
Add the following lines:
#!/bin/bash
# Here we have specified the interpreter for this example it will be the bash interpreter but this could be changed if you are writing python for example
for n in {1..4}
do
echo "Hi $n times"
done
Now save the file using the :w
vim command and exit vim using the :q
command.
Changing file permissions and running the script
To make the script an executable file we will have to change it's permissions.
Here we will use the following commands:
- chmod: used to change file permissions
This time we will make the file executable for the owner of the file only but you can allow anybody to have read, write and execute permissions on your file. (see cheatsheet)
[user@elja-irhpc my_scripts]$ ls -la # This will show us more detailed information about our file including permissions
.
..
-rw-r--r-- 1 user user 216 Jul 23 13:00 script.sh
As we can see the user has read and write permissions on this file while group
and others
have read permissions.
We want to change this so the user is the only one with read, write and execute permissions
[user@elja-irhpc my_scripts]$ chmod 700 script.sh
[user@elja-irhpc my_scripts]$ ls -la
-rwx------ 1 user user 216 Jul 23 13:00 script.sh
Now we see that in the user section we have rwx (read, write and execute) and we can easily run the script.
[user@elja-irhpc my_scripts]$ ./script.sh
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times
[user@elja-irhpc my_scripts]$ cd # Return to your home directory
LMOD and slurm
Our cluster uses LMOD to serve software and slurm to control, run and manage jobs and tasks on the cluster.
LMOD
"Lmod is a Lua-based module system that easily handles the MODULEPATH Hierarchical problem. Environment Modules provide a convenient way to dynamically change the users' environment through modulefiles."
Getting started
Currently we are working with multiple different module trees and are in the process of facing out the older ones. This guide will display and use modules from the older tree. For further information on the new module tree please take a look at the libraries chapter.
You are now connected to the cluster and want to start working with some specific software. To display the available modules we can use the avail
command:
[user@elja-irhpc ~]$ ml avail
This will display multiple lines of available modules to use that can feel a bit daunting but don't worry we will go over more streamlined methods to do this later on.
Alongside all these visible modules there are countless hidden packages which can be displayed by adding the --show-hidden
flag:
[user@elja-irhpc ~]$ ml --show-hidden avail
Now you will see even more lines of available modules that can be even more daunting!
Now lets say you want to use Golang (Go/Golang is an open source programming language). We can use the same commands as before but by adding the name of module/software/library you want to use to the end of the command.
[user@elja-irhpc ~]$ ml avail Go
---------------------------------------------------------------------------------------- /hpcapps/libsci-gcc/modules/all ----------------------------------------------------------------------------------------
Go/1.20.2 HDF5/1.12.2-gompi-2022a (D) ScaLAPACK/2.2.0-gompi-2023a-fb gompi/2021a gompi/2022b netCDF/4.6.2-gompi-2019a
HDF5/1.10.5-gompi-2019a HH-suite/3.3.0-gompi-2022a gompi/2019a gompi/2021b gompi/2023a (D) netCDF/4.8.0-gompi-2021a
HDF5/1.10.7-gompi-2021a PyGObject/3.42.1-GCCcore-11.3.0 gompi/2020a gompi/2022a netCDF-Fortran/4.6.0-gompi-2022a netCDF/4.9.0-gompi-2022a (D)
---------------------------------------------------------------------------------------- /hpcapps/lib-mimir/modules/all -----------------------------------------------------------------------------------------
HISAT2/2.2.1-gompi-2021b
-------------------------------------------------------------------------------------- /hpcapps/lib-edda/modules/all/Core ---------------------------------------------------------------------------------------
Go/1.17.6 gompi/2019b gompi/2021b gompi/2022a gompi/2022b gompi/2023a gompi/2023b
As we can see the amount of lines drastically decreased and we can easily see the module Go/1.20.2
at the top of the list.
To load the module into your environment we use the load
command
[user@elja-irhpc ~]$ ml load Go/1.20.2
To make sure the module is loaded we can run a simple Go command to test it
[user@elja-irhpc ~]$ go version
go version go1.20.2 linux/amd64
You can list all of the modules you have in your environment with the list
command
[user@elja-irhpc ~]$ ml list
Currently Loaded Modules:
1) Go/1.20.2
If you wish to remove the module from your environment you can use the unload
command
[user@elja-irhpc ~]$ ml unload Go
For more information on the module (ml) command you can take a look at this documentation provided by lmod.
Slurm
Slurm is a system for managing and scheduling Linux clusters.
Getting started
We can use slurm to see available partations and monitor the state, and other useful information, of our jobs.
For further information on slurm and the different slurm commands please take a look at their documentation
Partitions
More information about the available partitions on the cluster can be found here
The sinfo
command will show us all partations available to use
[user@elja-irhpc ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
any_cpu up 2-00:00:00 2 drng@ compute-[19,51]
any_cpu up 2-00:00:00 46 mix compute-[5-10,29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
any_cpu up 2-00:00:00 8 alloc compute-[39-40,87-92]
any_cpu up 2-00:00:00 23 idle compute-[11-18,20-28,38,48,53-54,60,76]
48cpu_192mem up 7-00:00:00 1 drng@ compute-19
48cpu_192mem up 7-00:00:00 6 mix compute-[5-10]
48cpu_192mem up 7-00:00:00 17 idle compute-[11-18,20-28]
64cpu_256mem up 7-00:00:00 1 drng@ compute-51
64cpu_256mem up 7-00:00:00 40 mix compute-[29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
64cpu_256mem up 7-00:00:00 8 alloc compute-[39-40,87-92]
64cpu_256mem up 7-00:00:00 6 idle compute-[38,48,53-54,60,76]
Here we can see some of the available partitions and their state. Note that the partitions are split on their state not name.
States
In the example above we can see four different states nodes within a partition can be
- drng@ -> Draining, these nodes are waiting to reboot after the maintainance period
- mix -> The nodes in this state are in use
- alloc -> These nodes are allocated and will be in use shortly
- idle -> These nodes are idle and free to use
The queue
We can use the squeue
command to look at our, and others, jobs in the queue.
[users@elja-irhpc ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1181224 128cpu_25 VG_2 user R 5-23:12:30 1 amd-compute-2
1182103 128cpu_25 VG_1 user R 5-04:19:08 1 amd-compute-3
1183036 128cpu_25 submit.s ppr R 1:34:51 1 amd-compute-1
1182505 128cpu_25 submit.s ppr R 2-01:16:33 1 amd-compute-4
1177152 48cpu_192 30P800V1 xmx R 20-01:10:04 1 compute-19
1177168 48cpu_192 30P800V1 xmx R 20-01:03:21 1 compute-19
1177167 48cpu_192 30P700V1 xmx R 20-01:03:24 1 compute-19
1181303 48cpu_192 30P300V1 xmx R 7-23:39:33 1 compute-8
1182920 48cpu_192 3CO_BBB_ user PD 0:00:00 1 (Resource)
As we can see we (user) have three jobs in the queue, two of which are in the state (ST) R while one is in a pending (PD) state.
The squeue
command has many options to view information but one of the most useful ones are the --user
flag
[user@elja-irhpc ~]$ squeue --user $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1181224 128cpu_25 VG_2 user R 5-23:12:30 1 amd-compute-2
1182103 128cpu_25 VG_1 user R 5-04:19:08 1 amd-compute-3
1182920 48cpu_192 3CO_BBB_ user PD 4:57:39 1 compute-9
The --user
flag filters the reults to only show the selected user
$USER is an environment variable that holds the username of the current user.
Detailed information
The scontrol
command can be used to gather detailed information about nodes, partitions and jobs. Below we will see some useful commands
[user@elja-irhpc ~]# scontrol show partition 48cpu_192mem
PartitionName=48cpu_192mem
AllowGroups=HPC-Elja,HPC-Elja-ltd AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=8 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=compute-[5-28]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2304 TotalNodes=24 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=2304,mem=4512000M,node=24,billing=2304
[user@elja-irhpc ~]# scontrol show node compute-5
NodeName=compute-5 Arch=x86_64 CoresPerSocket=24
CPUAlloc=2 CPUEfctv=96 CPUTot=96 CPULoad=1.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=compute-5 NodeHostName=compute-5 Version=23.11.4
OS=Linux 4.18.0-553.8.1.el8_10.x86_64 #1 SMP Tue Jul 2 17:10:26 UTC 2024
RealMemory=188000 AllocMem=7800 FreeMem=161095 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=any_cpu,48cpu_192mem,long,MatlabWorkshop,kerfis
BootTime=2024-07-29T07:06:28 SlurmdStartTime=2024-07-29T07:07:23
LastBusyTime=2024-08-14T09:37:54 ResumeAfterTime=None
CfgTRES=cpu=96,mem=188000M,billing=96
AllocTRES=cpu=2,mem=7800M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
[user@elja-irhpc ~]# scontrol show job 1181224
JobId=1181224 JobName=VG_2
UserId=user(11111) GroupId=user(1111) MCS_label=N/A
Priority=11930 Nice=0 Account=phys-ui QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=5-23:22:58 TimeLimit=6-22:00:00 TimeMin=N/A
SubmitTime=2024-08-06T04:53:18 EligibleTime=2024-08-06T04:53:18
AccrueTime=2024-08-06T04:53:18
StartTime=2024-08-08T12:24:15 EndTime=2024-08-15T10:24:15 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-08T12:24:15 Scheduler=Main
Partition=128cpu_256mem AllocNode:Sid=elja-irhpc:733110
ReqNodeList=(null) ExcNodeList=(null)
NodeList=amd-compute-2
BatchHost=amd-compute-2
NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=252000M,node=1,billing=1
AllocTRES=cpu=256,mem=252000M,node=1,billing=256
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/hpchome/user/submit.slurm
WorkDir=/hpchome/user/job1
StdErr=/hpchome/user/job1/slurm-1181224.out
StdIn=/dev/null
StdOut=/hpchome/user/job1/slurm-1181224.out
Power=
MailUser=user@email.is MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
Using srun and salloc for interactive jobs
srun
can allocate resources and launch jobs in a single command. srun
has multilpe flags which you can find here but for our purposes we will only use --partition
and --pty
--partition
will tell slurm which partition to use for the job, and --pty
will execute a command using a pseudo terminal. The most common usage on our cluster for this is to create an interactive job where you can work in the terminal of a compute node.
An example of this usage would look like this
[user@elja-irhpc ~]# srun --partition any_cpu --pty bash
[user@compute-15 ~]# hostname
compute-15
salloc
will allocate resources for us which we can than connect to using ssh
. Below is what that would look like
[user@elja-irhpc ~]# salloc --partition any_cpu
salloc: Granted job allocation 1183054
salloc: Nodes compute-15 are ready for job
[user@elja-irhpc ~]# ssh compute-15
Last login: Tue Aug 1 10:51:47 2023 from 12.16.71.2
[user@compute-15 ~]#
For further information about interactive sessions please check out our chapter on interactive sessions
Running jobs with sbatch
sbatch
is used to submit a job script that commonly has multiple flags which help us "tune" our job to our usecase.
For further information on sbatch check out our chapter on submitting batch jobs
Lets create our first simple batch script.
[user@elja-irhpc ~]$ touch first_job_script.sh
The header of the batch script should contain the following lines.
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=48cpu_192mem # request node from a specific partition
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=48 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-04:00:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
Note that the flags in the header should be changed fit our job. For example we do not need two nodes for our job and we definitely do not need the full 48 cores to do our job so we will change that.
After the header we can start adding environment variables and run various other commands, such as copy-ing/moving files to other locations or creating temporary directories.
For our example we will set one variable that we will call NUMTIMES
which will determine how many times our little command will loop.
First off lets change the flags and add that environment variable
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu # request node from a specific partition
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-00:05:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
NUMTIMES=5
As you can see we changed the partition since we don't really care which partition we use, we changed the number of nodes and tasks per node as well as lowering the time limit.
Now lets add the actual command we want to run and see if we can run it.
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu # request node from a specific partition
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-00:05:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
NUMTIMES=5
# We will use 'inline' python to run a small for loop
python -c "for n in range($NUMTIMES):print('Hi',n,'times');"
# After the command has finished slurm will clean up
[user@elja-irhpc ~]$ sbatch first_job_script.sh
Submitted batch job 1183060
Now lets take a look at the output files and see what we got.
[user@elja-irhpc ~]$ cat slurm_job_output.log
# Returns nothing
[user@elja-irhpc ~]$ cat slurm_job_errors.log
/var/spool/slurm/d/job1183060/slurm_script: line 17: python: command not found
We got an error saying that python is not found on the node. This is to be expected since our nodes only run the basic operating system and we need to have our environment properly set up first.
Lets add Python to our environment and see if it works
[user@elja-irhpc ~]$ ml load GCCcore/12.3.0 Python/.3.11.3
[user@elja-irhpc ~]$ sbatch first_job_script.show
Submitted batch job 1183063
[user@elja-irhpc ~]$ cat slurm_job_output.log
Hi 0 times
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times
Now it works perfectly! Amazing!
Canceling jobs
There might come a time where our job gets stuck in a loop or we need to cancel it for one reason or another and thats where the scancel
command comes in handy.
scancel
has some useful flags but in most usecases we only provide it with a JobId. If we only have one job running we can use the $SLURM_JOB_ID
environment variable.
Here are some examples of scancel
Cancel a single job
[user@elja-irhpc ~]$ scancel 124291
Cancel all jobs belonging to user
[user@elja-irhpc ~]$ scancel --user=user
Cancel all pending jobs on partition "any_cpu"
[suer@elja-irhpc ~]$ scancel --user=user --state=PENDING --partition=any_cpu