Subsections of Build&Install
Install On Debian
Cluster Setting
1 Manager, 1 Login Node and 2 Compute node:
hostname | IP | role | quota |
---|---|---|---|
manage01 | 192.168.56.115 | manager | 2C4G |
login01 | 192.168.56.116 | login | 2C4G |
compute01 | 192.168.56.117 | compute | 2C4G |
compute02 | 192.168.56.118 | compute | 2C4G |
Software Version:
software | version |
---|---|
os | Debian 12 bookworm |
slurm | 24.05.2 |
Prepare Steps (All Nodes)
- Modify the
/etc/network/interfaces
file (if you cannot get ipv4 address)
Append the following lines to the file
allow-hotplug enps08
iface enps08 inet dhcp
restart the network
systemctl restart networking
- Modify the
/etc/apt/sources.list
file Using tuna mirror
cat > /etc/apt/sources.list << EOF
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
EOF
- update apt cache
apt clean all && apt update
- set hostname on each node
hostnamectl set-hostname manage01
hostnamectl set-hostname login01
hostnamectl set-hostname compute01
hostnamectl set-hostname compute02
- set hosts file
cat >> /etc/hosts << EOF
192.168.56.115 manage01
192.168.56.116 login01
192.168.56.117 compute01
192.168.56.118 compute02
EOF
- disable firewall
systemctl stop nftables && systemctl disable nftables
- install packages
ntpdate
apt-get -y install ntpdate
sync server time
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo 'Asia/Shanghai' >/etc/timezone
ntpdate time.windows.com
- add cron job to sync time
crontab -e
*/5 * * * * /usr/sbin/ntpdate time.windows.com
- create ssh key pair on each node
ssh-keygen -t rsa -b 4096 -C $HOSTNAME
- ssh login without password [All Node]
ssh-copy-id -i ~/.ssh/id_rsa.pub root@login01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
ssh-copy-id -i ~/.ssh/id_rsa.pub root@manage01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
Install Components
- Install NFS server
(Manager Node)
there are many ways to install NFS server
- using
yum install -y nfs-utils
, check https://pkuhpc.github.io/SCOW/docs/hpccluster/nfs - using
apt install -y nfs-kernel-server
, check https://www.linuxtechi.com/how-to-install-nfs-server-on-debian/ - or you can directly mount other shared storage.
create shared folder
mkdir /data
chmod 755 /data
modify vim /etc/exports
/data *(rw,sync,insecure,no_subtree_check,no_root_squash)
start nfs server
systemctl start rpcbind
systemctl start nfs-server
systemctl enable rpcbind
systemctl enable nfs-server
check nfs server
showmount -e localhost
# Output
Export list for localhost:
/data *
- Install munge service
- add user munge
(All Nodes)
groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
- Install
rng-tools-debian
(Manager Nodes)
apt-get install -y rng-tools-debian
# modify service script
vim /usr/lib/systemd/system/rngd.service
[Service]
ExecStart=/usr/sbin/rngd -f -r /dev/urandom
systemctl daemon-reload
systemctl start rngd
systemctl enable rngd
apt-get install -y libmunge-dev libmunge2 munge
- generate secret key
(Manager Nodes)
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
- copy
munge.key
from manager node to the rest node(All Nodes)
scp -p /etc/munge/munge.key root@login01:/etc/munge/
scp -p /etc/munge/munge.key root@compute01:/etc/munge/
scp -p /etc/munge/munge.key root@compute02:/etc/munge/
- grant privilege on munge.key
(All Nodes)
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl start munge
systemctl enable munge
Using systemctl status munge
to check if the service is running
- test munge
munge -n | ssh compute01 unmunge
- Install Mariadb
(Manager Nodes)
apt-get install -y mariadb-server
- create database and user
systemctl start mariadb
systemctl enable mariadb
ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
- create user
slurm
,and grant all privileges on databaseslurm_acct_db
mysql -uroot -p$ROOT_PASS
create user slurm;
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
flush privileges;
- create Slurm user
groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
Install Slurm (All Nodes)
- Install basic Debian package build requirements:
apt-get install -y build-essential fakeroot devscripts equivs
- Unpack the distributed tarball:
wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 &&
tar -xaf slurm*tar.bz2
- cd to the directory containing the Slurm source:
cd slurm-24.05.2 && mkdir -p /etc/slurm && ./configure
- compile slurm
make install
modify configuration files
(Manager Nodes)
- modify
/etc/slurm/slurm.conf
Refer to slurm.conf
cp /root/slurm-24.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf vim /etc/slurm/slurm.conf
focus on these options:
SlurmctldHost=manage AccountingStorageEnforce=associations,limits,qos AccountingStorageHost=manage AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd JobCompHost=localhost JobCompLoc=slurm_acct_db JobCompPass=123456 JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobContainerType=job_container/none JobAcctGatherType=jobacct_gather/linux
- modify
/etc/slurm/slurmdbd.conf
Refer to slurmdbd.conf
cp /root/slurm-24.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf vim /etc/slurm/slurmdbd.conf
- modify
/etc/slurm/cgroup.conf
cp /root/slurm-24.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
- send configuration files to other nodes
scp -r /etc/slurm/*.conf root@login01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute02:/etc/slurm/
- modify
grant privilege on some directories
(All Nodes)
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
mkdir /var/log/slurm
chown slurm: /var/log/slurm
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
- start slurm services on each node
systemctl start slurmdbd
systemctl enable slurmdbd
systemctl start slurmctld
systemctl enable slurmctld
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
- test slurm check cluster configuration
scontrol show config
check cluster status
sinfo
scontrol show partition
scontrol show node
submit job
srun -N2 hostname
scontrol show jobs
check job status
squeue -a
Install From Binary
(All)
means all type nodes should install this component.
(Mgr)
means only the manager
node should install this component.
(Auth)
means only the Auth
node should install this component.
(Cmp)
means only the Compute
node should install this component.
Typically, there are three nodes are required to run Slurm. 1
Manage(Mgr)
, 1Auth
and NCompute(Cmp)
. but you can choose to install all service in single node. check
Prequisites
- change hostname
(All)
hostnamectl set-hostname (manager|auth|computeXX)
- modify
/etc/hosts
(All)
echo "192.aa.bb.cc (manager|auth|computeXX)" >> /etc/hosts
- disable firewall, selinux, dnsmasq, swap
(All)
. more detail here - NFS Server
(Mgr)
. NFS is used as the default file system for the Slurm accounting database. - [NFS Client]
(All)
. all node should mount the NFS share - Munge
(All)
. The auth/munge plugin will be built if the MUNGE authentication development library is installed. MUNGE is used as the default authentication mechanism. - Database
(Mgr)
. MySQL support for accounting will be built if the MySQL or MariaDB development library is present. A currently supported version of MySQL or MariaDB should be used.
Install Slurm
- create
slurm
user(All)
groupadd -g 1109 slurm useradd -m -c "slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
Build RPM package
install depeendencies
(Mgr)
yum -y install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3
build rpm package
(Mgr)
wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 rpmbuild -ta --nodeps slurm-24.05.2.tar.bz2
The rpm files will be installed under the
$(HOME)/rpmbuild
directory of the user building them.send rpm to rest nodes
(Mgr)
ssh root@<$rest_node> "mkdir -p /root/rpmbuild/RPMS/" scp -p $(HOME)/rpmbuild/RPMS/x86_64 root@<$rest_node>:/root/rpmbuild/RPMS/x86_64
install rpm
(Mgr)
ssh root@<$rest_node> "yum localinstall /root/rpmbuild/RPMS/x86_64/slurm-*"
modify configuration file
(Mgr)
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf chmod 600 /etc/slurm/slurmdbd.conf chown slurm: /etc/slurm/slurmdbd.conf
cgroup.conf
doesnt need to change.edit
/etc/slurm/slurm.conf
, you can use this link as a referenceedit
/etc/slurm/slurmdbd.conf
, you can use this link as a reference
Install yum repo directly
install slurm
(All)
yum -y slurm-wlm slurmdbd
modify configuration file
(All)
vim /etc/slurm-llnl/slurm.conf
vim /etc/slurm-llnl/slurmdbd.conf
cgroup.conf
doesnt need to change.edit
/etc/slurm/slurm.conf
, you can use this link as a referenceedit
/etc/slurm/slurmdbd.conf
, you can use this link as a reference
- send configuration
(Mgr)
scp -r /etc/slurm/*.conf root@<$rest_node>:/etc/slurm/ ssh rootroot@<$rest_node> "mkdir /var/spool/slurmd && chown slurm: /var/spool/slurmd" ssh rootroot@<$rest_node> "mkdir /var/log/slurm && chown slurm: /var/log/slurm" ssh rootroot@<$rest_node> "mkdir /var/spool/slurmctld && chown slurm: /var/spool/slurmctld"
- start service
(Mgr)
ssh rootroot@<$rest_node> "systemctl start slurmdbd && systemctl enable slurmdbd" ssh rootroot@<$rest_node> "systemctl start slurmctld && systemctl enable slurmctld"
- start service
(All)
ssh rootroot@<$rest_node> "systemctl start slurmd && systemctl enable slurmd"
Test
- show cluster status
scontrol show config
sinfo
scontrol show partition
scontrol show node
- submit job
srun -N2 hostname
scontrol show jobs
- check job status
squeue -a