Tuesday, October 1, 2019

Adding GPU as Resource for Slurm

To make a GPU part of resource that can be managed by Slurm, create /etc/slurm-llnl/gres.conf file with definitions of GPUs available on the node. GRES stands for generic resources, and need to be declared so that slurm can manage it.


Below example is for a node with nvidia tesla v100 gpu. Name - name of the resource, can be gpu, nic or mic
Type - arbitrary string identifying the type of device
File - Fully  qualified pathname of the device files associated with a resource
Cores - specific cpu core numbers, which can use this resource
$ sudo cat /etc/slurm-llnl/gres.conf
Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,1



Add GresTypes and gres resources in slurm.conf.
The format for gres resources is grestype:optional-type:number-of-resource
$ sudo cat /etc/slurm-llnl/slurm.conf
...
GresTypes=gpu
NodeName=mynode CPUs=12 RealMemory=64091 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:v100:1
...


Restart slurm services to have the changes take effect
$ sudo systemctl restart slurmd


Check the availability of the gres
$ scontrol show node

No comments: