What is SCOW? SCOW is a HPC cluster management system built by PKU.
SCOW used four virtual machines to run slurm cluster. It is a good choice for you to learn how to use slurm.
You should check https://pkuhpc.github.io/OpenSCOW/docs/hpccluster, it works well.
Subsections of Build & Install
Install On Debian
Cluster Setting
1 Manager
1 Login Node
2 Compute nodes
hostname
IP
role
quota
manage01 (slurmctld, slurmdbd)
192.168.56.115
manager
2C4G
login01 (login)
192.168.56.116
login
2C4G
compute01 (slurmd)
192.168.56.117
compute
2C4G
compute02 (slurmd)
192.168.56.118
compute
2C4G
Software Version:
software
version
os
Debian 12 bookworm
slurm
24.05.2
Important
when you see (All Nodes), you need to run the following command on all nodes
when you see (Manager Node), you only need to run the following command on manager node
when you see (Login Node), you only need to run the following command on login node
Prepare Steps (All Nodes)
Modify the /etc/apt/sources.list file
Using tuna mirror
cat > /etc/apt/sources.list << EOF
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
EOF
Using `systemctl status xxxx` to check if the `xxxx` service is running
Example slurmdbd.server
```text
# vim /usr/lib/systemd/system/slurmdbd.service
[Unit]
Description=Slurm DBD accounting daemon
After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurmdbd.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmdbd
EnvironmentFile=-/etc/default/slurmdbd
User=slurm
Group=slurm
RuntimeDirectory=slurmdbd
RuntimeDirectoryMode=0755
ExecStart=/usr/local/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
```
Example slumctld.server
```text
# vim /usr/lib/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target remote-fs.target munge.service sssd.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
User=slurm
Group=slurm
RuntimeDirectory=slurmctld
RuntimeDirectoryMode=0755
ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
```
Example slumd.server
```text
# vim /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target sssd.service
Wants=network-online.target
#ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/slurmd
EnvironmentFile=-/etc/default/slurmd
RuntimeDirectory=slurm
RuntimeDirectoryMode=0755
ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
```
systemctl start slurmd
systemctl enable slurmd
Using `systemctl status slurmd` to check if the `slurmd` service is running
systemctl start slurmd
systemctl enable slurmd
Using `systemctl status slurmd` to check if the `slurmd` service is running
systemctl start slurmd
systemctl enable slurmd
Using `systemctl status slurmd` to check if the `slurmd` service is running
Test Your Slurm Cluster (Login Node)
check cluster configuration
scontrol show config
check cluster status
sinfo
scontrol show partition
scontrol show node
submit job
srun -N2 hostname
scontrol show jobs
check job status
check job status
squeue -a
Install From Binary
Important
(All Nodes) means all type nodes should install this component.
(Manager Node) means only the manager node should install this component.
(Login Node) means only the Auth node should install this component.
(Cmp) means only the Compute node should install this component.
Typically, there are three nodes are required to run Slurm.
1 Manage(Manager Node), 1 Login Node and N Compute(Cmp).
but you can choose to install all service in single node. check
disable firewall, selinux, dnsmasq, swap (All Nodes). more detail here
NFS Server(Manager Node). NFS is used as the default file system for the Slurm accounting database.
[NFS Client] (All Nodes). all node should mount the NFS share
Install NFS Client
mount <$nfs_server>:/data /data -o proto=tcp -o nolock
Munge(All Nodes). The auth/munge plugin will be built if the MUNGE authentication development library is installed. MUNGE is used as the default authentication mechanism.
Install Munge
Database (Manager Node). MySQL support for accounting will be built if the MySQL or MariaDB development library is present. A currently supported version of MySQL or MariaDB should be used.
Install MariaDB
install mariadb
yum -y install mariadb-server
systemctl start mariadb && systemctl enable mariadb
ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
[root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yamlnamespace/slurm created
customresourcedefinition.apiextensions.k8s.io/slurmdeployments.slurm.ay.dev created
serviceaccount/slurm-operator-controller-manager created
role.rbac.authorization.k8s.io/slurm-operator-leader-election-role created
clusterrole.rbac.authorization.k8s.io/slurm-operator-manager-role created
clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-auth-role created
clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-admin-role created
clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-editor-role created
clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-viewer-role created
rolebinding.rbac.authorization.k8s.io/slurm-operator-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-metrics-auth-rolebinding created
service/slurm-operator-controller-manager-metrics-service created
deployment.apps/slurm-operator-controller-manager created
check operator status
kubectl -n slurm get pod
Expectd Output
[root@ay-zj-ecs operator]# kubectl -n slurm get podNAME READY STATUS RESTARTS AGE
slurm-operator-controller-manager 1/1 Running 0 27s