Install On Debian
Cluster Setting
- 1 Manager
- 1 Login Node
- 2 Compute nodes
| hostname | IP | role | quota |
|---|---|---|---|
| manage01 (slurmctld, slurmdbd) | 192.168.56.115 | manager | 2C4G |
| login01 (login) | 192.168.56.116 | login | 2C4G |
| compute01 (slurmd) | 192.168.56.117 | compute | 2C4G |
| compute02 (slurmd) | 192.168.56.118 | compute | 2C4G |
Software Version:
| software | version |
|---|---|
| os | Debian 12 bookworm |
| slurm | 24.05.2 |
Important
when you see (All Nodes), you need to run the following command on all nodes
when you see (Manager Node), you only need to run the following command on manager node
when you see (Login Node), you only need to run the following command on login node
Prepare Steps (All Nodes)
- Modify the
/etc/apt/sources.listfile Using tuna mirror
cat > /etc/apt/sources.list << EOF
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
EOF- Update apt cache
apt clean all && apt update- Set hostname on each node
Node:
hostnamectl set-hostname manage01
hostnamectl set-hostname login01
hostnamectl set-hostname compute01
hostnamectl set-hostname compute02- Set hosts file
cat >> /etc/hosts << EOF
192.168.56.115 manage01
192.168.56.116 login01
192.168.56.117 compute01
192.168.56.118 compute02
EOF- Disable firewall
systemctl stop nftables && systemctl disable nftables- Install packages
ntpdate
apt-get -y install ntpdate- Sync server time
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo 'Asia/Shanghai' >/etc/timezone
ntpdate time.windows.com- Add cron job to sync time
crontab -e
*/5 * * * * /usr/sbin/ntpdate time.windows.com- Create ssh key pair on each node
ssh-keygen -t rsa -b 4096 -C $HOSTNAME- Test ssh login other nodes without password
Node:
ssh-copy-id -i ~/.ssh/id_rsa.pub root@login01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
ssh-copy-id -i ~/.ssh/id_rsa.pub root@manage01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02Install Components
- Install NFS server
(Manager Node)
there are many ways to install NFS server
- using
yum install -y nfs-utils, check https://pkuhpc.github.io/SCOW/docs/hpccluster/nfs - using
apt install -y nfs-kernel-server, check https://www.linuxtechi.com/how-to-install-nfs-server-on-debian/ - or you can directly mount other shared storage.
create shared folder
mkdir /data
chmod 755 /datamodify vim /etc/exports
/data *(rw,sync,insecure,no_subtree_check,no_root_squash)start nfs server
systemctl start rpcbind
systemctl start nfs-server
systemctl enable rpcbind
systemctl enable nfs-servercheck nfs server
showmount -e localhost
# Output
Export list for localhost:
/data *
- Install munge service
- add user munge
(All Nodes)
groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge- Install
rng-tools-debian(Manager Nodes)
apt-get install -y rng-tools-debian# modify service script
vim /usr/lib/systemd/system/rngd.service[Service]
ExecStart=/usr/sbin/rngd -f -r /dev/urandomsystemctl daemon-reload
systemctl start rngd
systemctl enable rngdapt-get install -y libmunge-dev libmunge2 munge- generate secret key
(Manager Nodes)
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key- copy
munge.keyfrom manager node to the rest node(All Nodes)
scp -p /etc/munge/munge.key root@login01:/etc/munge/
scp -p /etc/munge/munge.key root@compute01:/etc/munge/
scp -p /etc/munge/munge.key root@compute02:/etc/munge/- grant privilege on munge.key
(All Nodes)
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl start munge
systemctl enable mungeUsing systemctl status munge to check if the service is running
- test munge
munge -n | ssh compute01 unmunge
- Install Mariadb
(Manager Nodes)
apt-get install -y mariadb-server- create database and user
systemctl start mariadb
systemctl enable mariadb
ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'- create user
slurm,and grant all privileges on databaseslurm_acct_db
mysql -uroot -p$ROOT_PASScreate user slurm;
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
flush privileges;- create Slurm user
groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurmInstall Slurm (All Nodes)
- Install basic Debian package build requirements:
apt-get install -y build-essential fakeroot devscripts equivs- Unpack the distributed tarball:
wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 &&
tar -xaf slurm*tar.bz2- cd to the directory containing the Slurm source:
cd slurm-24.05.2 && mkdir -p /etc/slurm && ./configure - compile slurm
make installmodify configuration files
(Manager Nodes)- modify
/etc/slurm/slurm.confRefer to slurm.conf
cp /root/slurm-24.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf vim /etc/slurm/slurm.conffocus on these options:
SlurmctldHost=manage AccountingStorageEnforce=associations,limits,qos AccountingStorageHost=manage AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd JobCompHost=localhost JobCompLoc=slurm_acct_db JobCompPass=123456 JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobContainerType=job_container/none JobAcctGatherType=jobacct_gather/linux- modify
/etc/slurm/slurmdbd.confRefer to slurmdbd.conf
cp /root/slurm-24.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf vim /etc/slurm/slurmdbd.conf- modify
/etc/slurm/cgroup.conf
cp /root/slurm-24.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf- send configuration files to other nodes
scp -r /etc/slurm/*.conf root@login01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute02:/etc/slurm/- modify
grant privilege on some directories
(All Nodes)
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
mkdir /var/log/slurm
chown slurm: /var/log/slurm
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf- start slurm services on each node
Node:
systemctl start slurmdbd
systemctl enable slurmdbd
systemctl start slurmctld
systemctl enable slurmctld
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmdTest Your Slurm Cluster (Login Node)
- check cluster configuration
scontrol show config- check cluster status
sinfo
scontrol show partition
scontrol show node- submit job
srun -N2 hostname
scontrol show jobs- check job status
check job status
squeue -a