Basic usage
This section will be split into two parts: the first being for absolute Linux command line beginners and the second being for first-time users working with modules and Slurm on an HPC cluster.
Entry to the Linux command line
First time logging into the cluster can be daunting. How do you do anything?
First things first: When you have connected to Elja, you will be greeted with a welcome message and your prompt will be at the bottom, which will look something like this:
[user@elja-irhpc ~]$
Most of the commands featured in this guide have a --help flag or a man page. To use those either call, for example with mkdir: mkdir --help or man mkdir
Creating your first files
You are now in your home directory. Let's create your first files and directories:
Here we will use the following commands:
- mkdir: stands for make directory and creates a directory
- touch: will create a generic file, you can put whatever file ending you want
- cd: stands for "Change Directory" and moves us between directories
- ls: lists each file or directory in your current directory
- pwd: shows us which directory we are in
- echo: prints out whatever text you put in
- cat: prints the contents of a file
[user@elja-irhpc ~]$ mkdir first_dir
[user@elja-irhpc ~]$ ls
first_dir
[user@elja-irhpc ~]$ cd first_dir
[user@elja-irhpc first_dir]$ pwd
/hpchome/user/first_dir
[user@elja-irhpc first_dir]$ touch first_file.txt
[user@elja-irhpc first_dir]$ echo "My first text" >> first_file.txt
[user@elja-irhpc first_dir]$ cat first_file.txt
My first text
Editing files
After creating your first files, you might want to edit them, change their name or contents.
Here we will use the following commands:
- mv: moves and/or renames files
- rm: deletes files
- vimtutor: Tutor to learn the basics of vim
- vim: A text editor (see cheatsheet)
Now let's say you want to rename your directory, remove your text file, and make a new shell script.
[user@elja-irhpc first_dir]$ cd
[user@elja-irhpc ~]$ ls
first_dir
# mv can be used to move files to different locations or rename files
[user@elja-irhpc ~]$ mv first_dir/ my_scripts/
[user@elja-irhpc ~]$ ls
my_scripts
[user@elja-irhpc ~]$ cd my_scripts
[user@elja-irhpc my_scripts]$ ls
first_file.txt
[user@elja-irhpc my_scripts]$ rm first_file.txt
[user@elja-irhpc my_scripts]$ ls
[user@elja-irhpc my_scripts]$ touch script.sh
[user@elja-irhpc my_scripts]$ ls
script.sh
Now we have to add something to the file. This time we are going to use the VIM text editor. To get you started on using vim, please use the vimtutor command to learn the basics of working with text files in vim.
[user@elja-irhpc my_scripts]$ vimtutor
Alright, now that you know how to use vim, we can open our script.
[user@elja-irhpc my_scripts]$ vim script.sh
Add the following lines:
#!/bin/bash
# Here we have specified the interpreter for this example it will be the bash interpreter but this could be changed if you are writing python for example
for n in {1..4}
do
echo "Hi $n times"
done
Now save the file using the :w vim command and exit vim using the :q command.
Changing file permissions and running the script
To make the script an executable file, we will have to change its permissions.
Here we will use the following commands:
- chmod: used to change file permissions
This time we will make the file executable for the owner of the file only, but you can allow anybody to have read, write, and execute permissions on your file. (see cheatsheet)
[user@elja-irhpc my_scripts]$ ls -la # This will show us more detailed information about our file including permissions
.
..
-rw-r--r-- 1 user user 216 Jul 23 13:00 script.sh
As we can see, the user has read and write permissions on this file while group and others have read permissions.
We want to change this so the user is the only one with read, write, and execute permissions.
[user@elja-irhpc my_scripts]$ chmod 700 script.sh
[user@elja-irhpc my_scripts]$ ls -la
-rwx------ 1 user user 216 Jul 23 13:00 script.sh
Now we see that in the user section we have rwx (read, write, and execute) and we can easily run the script.
[user@elja-irhpc my_scripts]$ ./script.sh
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times
[user@elja-irhpc my_scripts]$ cd # Return to your home directory
LMOD and Slurm
Our cluster uses LMOD to serve software and Slurm to control, run, and manage jobs and tasks on the cluster.
LMOD
Lmod is a Lua-based module system that easily handles the MODULEPATH Hierarchical problem. Environment Modules provide a convenient way to dynamically change the users' environment through modulefiles.
Getting started
Currently we are working with multiple different module trees and are in the process of phasing out the older ones. This guide will display and use modules from the older tree. For further information on the new module tree, please take a look at the libraries chapter.
You are now connected to the cluster and want to start working with some specific software. To display the available modules, we can use the avail command:
[user@elja-irhpc ~]$ ml avail
This will display multiple lines of available modules to use that can feel a bit daunting, but don't worry, we will go over more streamlined methods to do this later on.
Alongside all these visible modules, there are countless hidden packages which can be displayed by adding the --show-hidden flag:
[user@elja-irhpc ~]$ ml --show-hidden avail
Now you will see even more lines of available modules that can be even more daunting!
Now let's say you want to use Golang (Go/Golang is an open source programming language). We can use the same commands as before but by adding the name of module/software/library you want to use to the end of the command.
[user@elja-irhpc ~]$ ml avail Go
---------------------------------------------------------------------------------------- /hpcapps/libsci-gcc/modules/all ----------------------------------------------------------------------------------------
Go/1.20.2 HDF5/1.12.2-gompi-2022a (D) ScaLAPACK/2.2.0-gompi-2023a-fb gompi/2021a gompi/2022b netCDF/4.6.2-gompi-2019a
HDF5/1.10.5-gompi-2019a HH-suite/3.3.0-gompi-2022a gompi/2019a gompi/2021b gompi/2023a (D) netCDF/4.8.0-gompi-2021a
HDF5/1.10.7-gompi-2021a PyGObject/3.42.1-GCCcore-11.3.0 gompi/2020a gompi/2022a netCDF-Fortran/4.6.0-gompi-2022a netCDF/4.9.0-gompi-2022a (D)
---------------------------------------------------------------------------------------- /hpcapps/lib-mimir/modules/all -----------------------------------------------------------------------------------------
HISAT2/2.2.1-gompi-2021b
-------------------------------------------------------------------------------------- /hpcapps/lib-edda/modules/all/Core ---------------------------------------------------------------------------------------
Go/1.17.6 gompi/2019b gompi/2021b gompi/2022a gompi/2022b gompi/2023a gompi/2023b
As we can see, the amount of lines drastically decreased, and we can easily see the module Go/1.20.2 at the top of the list.
To load the module into your environment, we use the load command:
[user@elja-irhpc ~]$ ml load Go/1.20.2
To make sure the module is loaded, we can run a simple Go command to test it:
[user@elja-irhpc ~]$ go version
go version go1.20.2 linux/amd64
You can list all of the modules you have in your environment with the list command:
[user@elja-irhpc ~]$ ml list
Currently Loaded Modules:
1) Go/1.20.2
If you wish to remove the module from your environment, you can use the unload command:
[user@elja-irhpc ~]$ ml unload Go
For more information on the module (ml) command, you can take a look at this documentation provided by Lmod.
Slurm
Slurm is a system for managing and scheduling Linux clusters.
Getting started
We can use Slurm to see available partitions and monitor the state and other useful information of our jobs.
For further information on Slurm and the different Slurm commands, please take a look at their documentation
Partitions
More information about the available partitions on the cluster can be found here
The sinfo command will show us all partitions available to use:
[user@elja-irhpc ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
any_cpu up 2-00:00:00 2 drng@ compute-[19,51]
any_cpu up 2-00:00:00 46 mix compute-[5-10,29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
any_cpu up 2-00:00:00 8 alloc compute-[39-40,87-92]
any_cpu up 2-00:00:00 23 idle compute-[11-18,20-28,38,48,53-54,60,76]
48cpu_192mem up 7-00:00:00 1 drng@ compute-19
48cpu_192mem up 7-00:00:00 6 mix compute-[5-10]
48cpu_192mem up 7-00:00:00 17 idle compute-[11-18,20-28]
64cpu_256mem up 7-00:00:00 1 drng@ compute-51
64cpu_256mem up 7-00:00:00 40 mix compute-[29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
64cpu_256mem up 7-00:00:00 8 alloc compute-[39-40,87-92]
64cpu_256mem up 7-00:00:00 6 idle compute-[38,48,53-54,60,76]
Here we can see some of the available partitions and their state. Note that the partitions are split on their state, not name.
States
In the example above, we can see four different states nodes within a partition can be:
- drng@ -> Draining, these nodes are waiting to reboot after the maintenance period
- mix -> The nodes in this state are in use
- alloc -> These nodes are allocated and will be in use shortly
- idle -> These nodes are idle and free to use
The queue
We can use the squeue command to look at our, and others', jobs in the queue.
[user@elja-irhpc ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1181224 128cpu_25 VG_2 user R 5-23:12:30 1 amd-compute-2
1182103 128cpu_25 VG_1 user R 5-04:19:08 1 amd-compute-3
1183036 128cpu_25 submit.s ppr R 1:34:51 1 amd-compute-1
1182505 128cpu_25 submit.s ppr R 2-01:16:33 1 amd-compute-4
1177152 48cpu_192 30P800V1 xmx R 20-01:10:04 1 compute-19
1177168 48cpu_192 30P800V1 xmx R 20-01:03:21 1 compute-19
1177167 48cpu_192 30P700V1 xmx R 20-01:03:24 1 compute-19
1181303 48cpu_192 30P300V1 xmx R 7-23:39:33 1 compute-8
1182920 48cpu_192 3CO_BBB_ user PD 0:00:00 1 (Resource)
As we can see, we (user) have three jobs in the queue, two of which are in the state (ST) R while one is in a pending (PD) state.
The squeue command has many options to view information, but one of the most useful ones is the --user flag:
[user@elja-irhpc ~]$ squeue --user $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1181224 128cpu_25 VG_2 user R 5-23:12:30 1 amd-compute-2
1182103 128cpu_25 VG_1 user R 5-04:19:08 1 amd-compute-3
1182920 48cpu_192 3CO_BBB_ user PD 4:57:39 1 compute-9
The --user flag filters the results to only show the selected user.
$USER is an environment variable that holds the username of the current user.
Detailed information
The scontrol command can be used to gather detailed information about nodes, partitions, and jobs. Below we will see some useful commands:
[user@elja-irhpc ~]# scontrol show partition 48cpu_192mem
PartitionName=48cpu_192mem
AllowGroups=HPC-Elja,HPC-Elja-ltd AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=8 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=compute-[5-28]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2304 TotalNodes=24 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=2304,mem=4512000M,node=24,billing=2304
[user@elja-irhpc ~]# scontrol show node compute-5
NodeName=compute-5 Arch=x86_64 CoresPerSocket=24
CPUAlloc=2 CPUEfctv=96 CPUTot=96 CPULoad=1.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=compute-5 NodeHostName=compute-5 Version=23.11.4
OS=Linux 4.18.0-553.8.1.el8_10.x86_64 #1 SMP Tue Jul 2 17:10:26 UTC 2024
RealMemory=188000 AllocMem=7800 FreeMem=161095 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=any_cpu,48cpu_192mem,long,MatlabWorkshop,kerfis
BootTime=2024-07-29T07:06:28 SlurmdStartTime=2024-07-29T07:07:23
LastBusyTime=2024-08-14T09:37:54 ResumeAfterTime=None
CfgTRES=cpu=96,mem=188000M,billing=96
AllocTRES=cpu=2,mem=7800M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
[user@elja-irhpc ~]# scontrol show job 1181224
JobId=1181224 JobName=VG_2
UserId=user(11111) GroupId=user(1111) MCS_label=N/A
Priority=11930 Nice=0 Account=phys-ui QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=5-23:22:58 TimeLimit=6-22:00:00 TimeMin=N/A
SubmitTime=2024-08-06T04:53:18 EligibleTime=2024-08-06T04:53:18
AccrueTime=2024-08-06T04:53:18
StartTime=2024-08-08T12:24:15 EndTime=2024-08-15T10:24:15 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-08T12:24:15 Scheduler=Main
Partition=128cpu_256mem AllocNode:Sid=elja-irhpc:733110
ReqNodeList=(null) ExcNodeList=(null)
NodeList=amd-compute-2
BatchHost=amd-compute-2
NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=252000M,node=1,billing=1
AllocTRES=cpu=256,mem=252000M,node=1,billing=256
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/hpchome/user/submit.slurm
WorkDir=/hpchome/user/job1
StdErr=/hpchome/user/job1/slurm-1181224.out
StdIn=/dev/null
StdOut=/hpchome/user/job1/slurm-1181224.out
Power=
MailUser=user@email.is MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
Using srun and salloc for interactive jobs
srun can allocate resources and launch jobs in a single command. srun has multiple flags which you can find here, but for our purposes, we will only use --partition and --pty.
--partition will tell Slurm which partition to use for the job, and --pty will execute a command using a pseudo terminal. The most common usage on our cluster for this is to create an interactive job where you can work in the terminal of a compute node.
An example of this usage would look like this:
[user@elja-irhpc ~]# srun --partition any_cpu --pty bash
[user@compute-15 ~]# hostname
compute-15
salloc will allocate resources for us which we can then connect to using ssh. Below is what that would look like:
[user@elja-irhpc ~]# salloc --partition any_cpu
salloc: Granted job allocation 1183054
salloc: Nodes compute-15 are ready for job
[user@elja-irhpc ~]# ssh compute-15
Last login: Tue Aug 1 10:51:47 2023 from 12.16.71.2
[user@compute-15 ~]#
For further information about interactive sessions, please check out our chapter on interactive sessions
Running jobs with sbatch
sbatch is used to submit a job script that commonly has multiple flags which help us "tune" our job to our use case.
For further information on sbatch, check out our chapter on submitting batch jobs
Let's create our first simple batch script.
[user@elja-irhpc ~]$ touch first_job_script.sh
The header of the batch script should contain the following lines:
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=48cpu_192mem # request node from a specific partition
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=48 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-04:00:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
Note that the flags in the header should be changed to fit our job. For example, we do not need two nodes for our job, and we definitely do not need the full 48 cores to do our job, so we will change that.
After the header, we can start adding environment variables and run various other commands, such as copying/moving files to other locations or creating temporary directories.
For our example, we will set one variable that we will call NUMTIMES, which will determine how many times our little command will loop.
First off, let's change the flags and add that environment variable:
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu # request node from a specific partition
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-00:05:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
NUMTIMES=5
As you can see, we changed the partition since we don't really care which partition we use. We changed the number of nodes and tasks per node as well as lowering the time limit.
Now let's add the actual command we want to run and see if we can run it.
[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu # request node from a specific partition
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-00:05:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
NUMTIMES=5
# We will use 'inline' python to run a small for loop
python -c "for n in range($NUMTIMES):print('Hi',n,'times');"
# After the command has finished slurm will clean up
[user@elja-irhpc ~]$ sbatch first_job_script.sh
Submitted batch job 1183060
Now let's take a look at the output files and see what we got.
[user@elja-irhpc ~]$ cat slurm_job_output.log
# Returns nothing
[user@elja-irhpc ~]$ cat slurm_job_errors.log
/var/spool/slurm/d/job1183060/slurm_script: line 17: python: command not found
We got an error saying that python is not found on the node. This is to be expected since our nodes only run the basic operating system, and we need to have our environment properly set up first.
Let's add Python to our environment and see if it works:
[user@elja-irhpc ~]$ ml load GCCcore/12.3.0 Python/.3.11.3
[user@elja-irhpc ~]$ sbatch first_job_script.sh
Submitted batch job 1183063
[user@elja-irhpc ~]$ cat slurm_job_output.log
Hi 0 times
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times
Now it works perfectly! Amazing!
Canceling jobs
There might come a time where our job gets stuck in a loop or we need to cancel it for one reason or another, and that's where the scancel command comes in handy.
scancel has some useful flags, but in most use cases, we only provide it with a JobId. If we only have one job running, we can use the $SLURM_JOB_ID environment variable.
Here are some examples of scancel:
Cancel a single job:
[user@elja-irhpc ~]$ scancel 124291
Cancel all jobs belonging to user:
[user@elja-irhpc ~]$ scancel --user=user
Cancel all pending jobs on partition "any_cpu":
[user@elja-irhpc ~]$ scancel --user=user --state=PENDING --partition=any_cpu