Tuesday, October 1, 2019

Installing Slurm Workload Manager & Job Scheduler on Ubuntu 18.04

Enable universe repository
$ echo "deb http://archive.ubuntu.com/ubuntu bionic universe" | sudo tee -a /etc/apt/sources.list

Update package list
$ sudo apt update

Install slurm-wlm
$ sudo apt install slurm-wlm -y

Install slurm documentation. This is useful to generate slurm.conf using configurator.easy.html page
$ sudo apt install slurm-wlm-doc -y

Get a machine with a web browser, and open /usr/share/doc/slurm-wlm-doc/html/configurator.easy.html to easily generate slurm.conf.

You can also access the configurator online at https://slurm.schedmd.com/configurator.easy.html, but depending on your slurm version, the online version might not be suitable.

Fill up the form, some of the information can be retrieved using command
$ slurmd -C

Some of the configuration that I changed from the default
- Make sure the hostname of the system is ControlMachine and NodeName
- State Preservation: set StateSaveLocation to /var/spool/slurm-llnl
- Process tracking: use Pgid instead of Cgroup
- Process ID logging: set this to /var/run/slurm-llnl/slurmctld.pid and /var/run/slurm-llnl/slurmd.pid


Once done, click submit, and copy the generated config file to /etc/slurm-llnl/slurm.conf. Below is my sample config, with only one node
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=myserver
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=myserver CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=myserver Default=YES MaxTime=INFINITE State=UP
DebugFlags=NO_CONF_HASH

Create slurm spool directory
$ sudo mkdir /var/spool/slurm-llnl
$ sudo chown -R slurm.slurm /var/spool/slurm-llnl

Create slurm pid directory
$ sudo mkdir /var/run/slurm-llnl/
$ sudo chown -R slurm.slurm /var/run/slurm-llnl

Start and enable the slurm manager on boot
$ sudo systemctl start slurmctld
$ sudo systemctl enable slurmctld

Start slurmd and enable on boot
$ sudo systemctl start slurmd
$ sudo systemctl enable slurmd

If somehow slurmcrld or slurmd failed to start, run the applications interactively with debug options, to check for any errors. If there is any error, adjust slurm.conf accordingly.
$ sudo -u slurm slurmctld -Dcvvv
$ sudo slurmd -Dcvvv

Check slurm ndoes using scontrol command
$ scontrol show node

No comments: