Basic usage

This section will be split into two parts the first being for absolute linux command line beginners and the second being for the first time working with modules and slurm on an HPC cluster.

Entry to the linux command line

First time logging into the cluster can be daunting, how do you do anything?

First things first: When you have connected to Elja you will be greeted with welcome message and your prompt will be at the bottom, that will look something like this:

[user@elja-irhpc ~]$

note

Most of the commands that are featured in this guide have a --help flag or a man page. To use those either call, for example we will use mkdir, mkdir --help or man mkdir

Creating your first files

You are now in your home directory, lets create your first files and directories:

Here we will use the following commands:

mkdir: stands for make directory and creates a directory
touch: will create a generic file, you can put whatever file ending you want
cd: stands for "Change Directory" and moves us between directories
ls: lists each file or directory in your current directory
pwd: shows us which directory we are in
echo: prints out what ever text you put in
cat: prints the contents of a file

[user@elja-irhpc ~]$ mkdir first_dir
[user@elja-irhpc ~]$ ls 
first_dir
[user@elja-irhpc ~]$ cd first_dir
[user@elja-irhpc first_dir]$ pwd
/hpchome/user/first_dir
[user@elja-irhpc first_dir]$ touch first_file.txt
[user@elja-irhpc first_dir]$ echo "My first text" >> first_file.txt
[user@elja-irhpc first_dir]$ cat first_file.txt
My first text

Editing files

After creating your first files you might want to edit them, change their name or contents

Here we will use the following commands:

mv: moves and/or renames files
rm: deletes files
vimtutor: Tutor to learn the basics of vim
vim: A text editor (see cheatsheet)

Now lets say you want to rename your directory and remove your text file and make a new shell script.

[user@elja-irhpc first_dir]$ cd 
[user@elja-irhpc ~]$ ls 
first_dir
# mv can be used to move files to different locations or rename files
[user@elja-irhpc ~]$ mv first_dir/ my_scripts/ 
[user@elja-irhpc ~]$ ls 
my_scripts
[user@elja-irhpc ~]$ cd my_scripts 
[user@elja-irhpc my_scripts]$ ls
first_file.txt
[user@elja-irhpc my_scripts]$ rm first_file.txt
[user@elja-irhpc my_scripts]$ ls
..
[user@elja-irhpc my_scripts]$ touch script.sh
[user@elja-irhpc my_scripts]$ ls 
script.sh

Now we have to add something to the file, this time we are going to use the VIM text editor. To get you started on using vim please use the vimtutor command to learn the basics of working with textfiles in vim.

[user@elja-irhpc my_scripts]$ vimtutor

Alright now that you know how to use vim we can open our script.

[user@elja-irhpc my_scripts]$ vim script.sh

Add the following lines:

#!/bin/bash 
# Here we have specified the interpreter for this example it will be the bash interpreter but this could be changed if you are writing python for example

for n in {1..4} 
do
    echo "Hi $n times"
done

Now save the file using the :w vim command and exit vim using the :q command.

Changing file permissions and running the script

To make the script an executable file we will have to change it's permissions.

Here we will use the following commands:

chmod: used to change file permissions

This time we will make the file executable for the owner of the file only but you can allow anybody to have read, write and execute permissions on your file. (see cheatsheet)

[user@elja-irhpc my_scripts]$ ls -la # This will show us more detailed information about our file including permissions
.
..
-rw-r--r--   1 user     user        216 Jul 23 13:00 script.sh

As we can see the user has read and write permissions on this file while group and others have read permissions. We want to change this so the user is the only one with read, write and execute permissions

[user@elja-irhpc my_scripts]$ chmod 700 script.sh 
[user@elja-irhpc my_scripts]$ ls -la 
-rwx------   1 user     user        216 Jul 23 13:00 script.sh

Now we see that in the user section we have rwx (read, write and execute) and we can easily run the script.

[user@elja-irhpc my_scripts]$ ./script.sh
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times
[user@elja-irhpc my_scripts]$ cd # Return to your home directory

LMOD and slurm

Our cluster uses LMOD to serve software and slurm to control, run and manage jobs and tasks on the cluster.

LMOD

"Lmod is a Lua-based module system that easily handles the MODULEPATH Hierarchical problem. Environment Modules provide a convenient way to dynamically change the users' environment through modulefiles."

Getting started

note

Currently we are working with multiple different module trees and are in the process of facing out the older ones. This guide will display and use modules from the older tree. For further information on the new module tree please take a look at the libraries chapter.

You are now connected to the cluster and want to start working with some specific software. To display the available modules we can use the avail command:

[user@elja-irhpc ~]$ ml avail

This will display multiple lines of available modules to use that can feel a bit daunting but don't worry we will go over more streamlined methods to do this later on.

Alongside all these visible modules there are countless hidden packages which can be displayed by adding the --show-hidden flag:

[user@elja-irhpc ~]$ ml --show-hidden avail

Now you will see even more lines of available modules that can be even more daunting!

Now lets say you want to use Golang (Go/Golang is an open source programming language). We can use the same commands as before but by adding the name of module/software/library you want to use to the end of the command.

[user@elja-irhpc ~]$ ml avail Go 
---------------------------------------------------------------------------------------- /hpcapps/libsci-gcc/modules/all ----------------------------------------------------------------------------------------
   Go/1.20.2                  HDF5/1.12.2-gompi-2022a         (D)    ScaLAPACK/2.2.0-gompi-2023a-fb    gompi/2021a    gompi/2022b                             netCDF/4.6.2-gompi-2019a
   HDF5/1.10.5-gompi-2019a    HH-suite/3.3.0-gompi-2022a             gompi/2019a                       gompi/2021b    gompi/2023a                      (D)    netCDF/4.8.0-gompi-2021a
   HDF5/1.10.7-gompi-2021a    PyGObject/3.42.1-GCCcore-11.3.0        gompi/2020a                       gompi/2022a    netCDF-Fortran/4.6.0-gompi-2022a        netCDF/4.9.0-gompi-2022a (D)

---------------------------------------------------------------------------------------- /hpcapps/lib-mimir/modules/all -----------------------------------------------------------------------------------------
   HISAT2/2.2.1-gompi-2021b

-------------------------------------------------------------------------------------- /hpcapps/lib-edda/modules/all/Core ---------------------------------------------------------------------------------------
   Go/1.17.6    gompi/2019b    gompi/2021b    gompi/2022a    gompi/2022b    gompi/2023a    gompi/2023b

As we can see the amount of lines drastically decreased and we can easily see the module Go/1.20.2 at the top of the list.

To load the module into your environment we use the load command

[user@elja-irhpc ~]$ ml load Go/1.20.2

To make sure the module is loaded we can run a simple Go command to test it

[user@elja-irhpc ~]$ go version 
go version go1.20.2 linux/amd64

You can list all of the modules you have in your environment with the list command

[user@elja-irhpc ~]$ ml list

Currently Loaded Modules:
  1) Go/1.20.2

If you wish to remove the module from your environment you can use the unload command

[user@elja-irhpc ~]$ ml unload Go

For more information on the module (ml) command you can take a look at this documentation provided by lmod.

Slurm

Slurm is a system for managing and scheduling Linux clusters.

Getting started

We can use slurm to see available partations and monitor the state, and other useful information, of our jobs.

note

For further information on slurm and the different slurm commands please take a look at their documentation

Partitions

note

More information about the available partitions on the cluster can be found here

The sinfo command will show us all partations available to use

[user@elja-irhpc ~]$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
any_cpu          up 2-00:00:00      2  drng@ compute-[19,51]
any_cpu          up 2-00:00:00     46    mix compute-[5-10,29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
any_cpu          up 2-00:00:00      8  alloc compute-[39-40,87-92]
any_cpu          up 2-00:00:00     23   idle compute-[11-18,20-28,38,48,53-54,60,76]
48cpu_192mem     up 7-00:00:00      1  drng@ compute-19
48cpu_192mem     up 7-00:00:00      6    mix compute-[5-10]
48cpu_192mem     up 7-00:00:00     17   idle compute-[11-18,20-28]
64cpu_256mem     up 7-00:00:00      1  drng@ compute-51
64cpu_256mem     up 7-00:00:00     40    mix compute-[29-37,41-47,49-50,52,55-59,61-63,73-75,77-86]
64cpu_256mem     up 7-00:00:00      8  alloc compute-[39-40,87-92]
64cpu_256mem     up 7-00:00:00      6   idle compute-[38,48,53-54,60,76]

Here we can see some of the available partitions and their state. Note that the partitions are split on their state not name.

States

In the example above we can see four different states nodes within a partition can be

drng@ -> Draining, these nodes are waiting to reboot after the maintainance period
mix -> The nodes in this state are in use
alloc -> These nodes are allocated and will be in use shortly
idle -> These nodes are idle and free to use

The queue

We can use the squeue command to look at our, and others, jobs in the queue.

[users@elja-irhpc ~]$ squeue
      JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1181224 128cpu_25     VG_2    user  R 5-23:12:30      1 amd-compute-2
           1182103 128cpu_25     VG_1    user  R 5-04:19:08      1 amd-compute-3
           1183036 128cpu_25 submit.s      ppr  R    1:34:51      1 amd-compute-1
           1182505 128cpu_25 submit.s      ppr  R 2-01:16:33      1 amd-compute-4
           1177152 48cpu_192 30P800V1    xmx  R 20-01:10:04      1 compute-19
           1177168 48cpu_192 30P800V1    xmx  R 20-01:03:21      1 compute-19
           1177167 48cpu_192 30P700V1    xmx  R 20-01:03:24      1 compute-19
           1181303 48cpu_192 30P300V1    xmx  R 7-23:39:33      1 compute-8
           1182920 48cpu_192 3CO_BBB_  user  PD    0:00:00      1 (Resource)

As we can see we (user) have three jobs in the queue, two of which are in the state (ST) R while one is in a pending (PD) state. The squeue command has many options to view information but one of the most useful ones are the --user flag

[user@elja-irhpc ~]$ squeue --user $USER
      JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1181224 128cpu_25     VG_2    user  R 5-23:12:30      1 amd-compute-2
           1182103 128cpu_25     VG_1    user  R 5-04:19:08      1 amd-compute-3
           1182920 48cpu_192 3CO_BBB_  user  PD    4:57:39      1 compute-9

The --user flag filters the reults to only show the selected user

note

$USER is an environment variable that holds the username of the current user.

Detailed information

The scontrol command can be used to gather detailed information about nodes, partitions and jobs. Below we will see some useful commands

[user@elja-irhpc ~]# scontrol show partition 48cpu_192mem
PartitionName=48cpu_192mem
   AllowGroups=HPC-Elja,HPC-Elja-ltd AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=8 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=compute-[5-28]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2304 TotalNodes=24 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=2304,mem=4512000M,node=24,billing=2304

[user@elja-irhpc ~]# scontrol show node compute-5
NodeName=compute-5 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=2 CPUEfctv=96 CPUTot=96 CPULoad=1.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=compute-5 NodeHostName=compute-5 Version=23.11.4
   OS=Linux 4.18.0-553.8.1.el8_10.x86_64 #1 SMP Tue Jul 2 17:10:26 UTC 2024 
   RealMemory=188000 AllocMem=7800 FreeMem=161095 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=any_cpu,48cpu_192mem,long,MatlabWorkshop,kerfis 
   BootTime=2024-07-29T07:06:28 SlurmdStartTime=2024-07-29T07:07:23
   LastBusyTime=2024-08-14T09:37:54 ResumeAfterTime=None
   CfgTRES=cpu=96,mem=188000M,billing=96
   AllocTRES=cpu=2,mem=7800M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
[user@elja-irhpc ~]# scontrol show job 1181224
    JobId=1181224 JobName=VG_2
       UserId=user(11111) GroupId=user(1111) MCS_label=N/A
       Priority=11930 Nice=0 Account=phys-ui QOS=normal
       JobState=RUNNING Reason=None Dependency=(null)
       Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
       RunTime=5-23:22:58 TimeLimit=6-22:00:00 TimeMin=N/A
       SubmitTime=2024-08-06T04:53:18 EligibleTime=2024-08-06T04:53:18
       AccrueTime=2024-08-06T04:53:18
       StartTime=2024-08-08T12:24:15 EndTime=2024-08-15T10:24:15 Deadline=N/A
       SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-08T12:24:15 Scheduler=Main
       Partition=128cpu_256mem AllocNode:Sid=elja-irhpc:733110
       ReqNodeList=(null) ExcNodeList=(null)
       NodeList=amd-compute-2
       BatchHost=amd-compute-2
       NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
       ReqTRES=cpu=1,mem=252000M,node=1,billing=1
       AllocTRES=cpu=256,mem=252000M,node=1,billing=256
       Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
       MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
       Features=(null) DelayBoot=00:00:00
       OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
       Command=/hpchome/user/submit.slurm
       WorkDir=/hpchome/user/job1
       StdErr=/hpchome/user/job1/slurm-1181224.out
       StdIn=/dev/null
       StdOut=/hpchome/user/job1/slurm-1181224.out
       Power=
       MailUser=user@email.is MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

Using srun and salloc for interactive jobs

srun can allocate resources and launch jobs in a single command. srun has multilpe flags which you can find here but for our purposes we will only use --partition and --pty

--partition will tell slurm which partition to use for the job, and --pty will execute a command using a pseudo terminal. The most common usage on our cluster for this is to create an interactive job where you can work in the terminal of a compute node.

An example of this usage would look like this

[user@elja-irhpc ~]# srun --partition any_cpu --pty bash
[user@compute-15 ~]# hostname
compute-15

salloc will allocate resources for us which we can than connect to using ssh. Below is what that would look like

[user@elja-irhpc ~]# salloc --partition any_cpu
salloc: Granted job allocation 1183054
salloc: Nodes compute-15 are ready for job
[user@elja-irhpc ~]# ssh compute-15
Last login: Tue Aug  1 10:51:47 2023 from 12.16.71.2
[user@compute-15 ~]# 

note

For further information about interactive sessions please check out our chapter on interactive sessions

Running jobs with sbatch

sbatch is used to submit a job script that commonly has multiple flags which help us "tune" our job to our usecase.

note

For further information on sbatch check out our chapter on submitting batch jobs

Lets create our first simple batch script.

[user@elja-irhpc ~]$ touch first_job_script.sh

The header of the batch script should contain the following lines.

[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=48cpu_192mem  # request node from a specific partition
#SBATCH --nodes=2                 # number of nodes
#SBATCH --ntasks-per-node=48      # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900        # MB RAM per cpu core
#SBATCH --time=0-04:00:00         # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread      # Suppress multithread
#SBATCH --output=slurm_job_output.log   
#SBATCH --error=slurm_job_errors.log   # Logs if job crashes

Note that the flags in the header should be changed fit our job. For example we do not need two nodes for our job and we definitely do not need the full 48 cores to do our job so we will change that.

After the header we can start adding environment variables and run various other commands, such as copy-ing/moving files to other locations or creating temporary directories.

For our example we will set one variable that we will call NUMTIMES which will determine how many times our little command will loop.

First off lets change the flags and add that environment variable

[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu  # request node from a specific partition
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=1      # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900        # MB RAM per cpu core
#SBATCH --time=0-00:05:00         # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread      # Suppress multithread
#SBATCH --output=slurm_job_output.log   
#SBATCH --error=slurm_job_errors.log   # Logs if job crashes


NUMTIMES=5

As you can see we changed the partition since we don't really care which partition we use, we changed the number of nodes and tasks per node as well as lowering the time limit.

Now lets add the actual command we want to run and see if we can run it.

[user@elja-irhpc ~]$ cat first_job_script.sh
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example uname@hi.is
#SBATCH --partition=any_cpu  # request node from a specific partition
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=1      # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900        # MB RAM per cpu core
#SBATCH --time=0-00:05:00         # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread      # Suppress multithread
#SBATCH --output=slurm_job_output.log   
#SBATCH --error=slurm_job_errors.log   # Logs if job crashes


NUMTIMES=5

# We will use 'inline' python to run a small for loop

python -c "for n in range($NUMTIMES):print('Hi',n,'times');"

# After the command has finished slurm will clean up


[user@elja-irhpc ~]$ sbatch first_job_script.sh
Submitted batch job 1183060

Now lets take a look at the output files and see what we got.

[user@elja-irhpc ~]$ cat slurm_job_output.log
# Returns nothing
[user@elja-irhpc ~]$ cat slurm_job_errors.log
/var/spool/slurm/d/job1183060/slurm_script: line 17: python: command not found

We got an error saying that python is not found on the node. This is to be expected since our nodes only run the basic operating system and we need to have our environment properly set up first.

Lets add Python to our environment and see if it works

[user@elja-irhpc ~]$ ml load GCCcore/12.3.0 Python/.3.11.3
[user@elja-irhpc ~]$ sbatch first_job_script.show
Submitted batch job 1183063
[user@elja-irhpc ~]$  cat slurm_job_output.log
Hi 0 times
Hi 1 times
Hi 2 times
Hi 3 times
Hi 4 times

Now it works perfectly! Amazing!

Canceling jobs

There might come a time where our job gets stuck in a loop or we need to cancel it for one reason or another and thats where the scancel command comes in handy.

scancel has some useful flags but in most usecases we only provide it with a JobId. If we only have one job running we can use the $SLURM_JOB_ID environment variable.

Here are some examples of scancel

Cancel a single job

[user@elja-irhpc ~]$ scancel 124291

Cancel all jobs belonging to user

[user@elja-irhpc ~]$ scancel --user=user

Cancel all pending jobs on partition "any_cpu"

[suer@elja-irhpc ~]$ scancel --user=user --state=PENDING --partition=any_cpu

Basic usage

Entry to the linux command line​

Creating your first files​

Editing files​

Changing file permissions and running the script​

LMOD and slurm

LMOD​

Getting started​

Slurm​

Getting started​

Partitions​

States​

The queue​

Detailed information​

Using srun and salloc for interactive jobs​

Running jobs with sbatch​

Canceling jobs​

Entry to the linux command line

Creating your first files

Editing files

Changing file permissions and running the script

LMOD

Getting started

Slurm

Getting started

Partitions

States

The queue

Detailed information

Using srun and salloc for interactive jobs

Running jobs with sbatch

Canceling jobs