Install On Ubuntu
Cluster Setting
- 1 Manager
- 1 Login Node
- 2 Compute nodes
| hostname | IP | role | quota |
|---|---|---|---|
| manage01 (slurmctld, slurmdbd) | 192.168.56.115 | manager | 2C4G |
| login01 (login) | 192.168.56.116 | login | 2C4G |
| compute01 (slurmd) | 192.168.56.117 | compute | 2C4G |
| compute02 (slurmd) | 192.168.56.118 | compute | 2C4G |
Software Version:
| software | version |
|---|---|
| os | Ubuntu 22.04 |
| slurm | 24.05.2 |
Important
when you see (All Nodes), you need to run the following command on all nodes
when you see (Manager Node), you only need to run the following command on manager node
when you see (Login Node), you only need to run the following command on login node
Prepare Steps (All Nodes)
- Modify the
/etc/apt/sources.listfile Using tuna mirror
cat > /etc/apt/sources.list << EOF
EOF- Update apt cache
apt clean all && apt update- Set hosts file
cat >> /etc/hosts << EOF
10.119.2.36 juice-036
10.119.2.37 juice-037
10.119.2.38 juice-038
EOF- Install packages
ntpdate
apt-get -y install ntpdate- Sync server time
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo 'Asia/Shanghai' >/etc/timezone
ntpdate ntp.aliyun.com- Add cron job to sync time
crontab -e
*/5 * * * * /usr/sbin/ntpdate ntp.aliyun.com- Create ssh key pair on each node
ssh-keygen -t rsa -b 4096 -C $HOSTNAME- Test ssh login other nodes without password
Node:
ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-036
ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-037
ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-038Install Components
- Install NFS server
(Manager Node)
there are many ways to install NFS server
- using
apt install -y nfs-kernel-server, check https://www.linuxtechi.com/how-to-install-nfs-server-on-debian/
create shared folder
mkdir /data
chmod 755 /datamodify vim /etc/exports
/data *(rw,sync,insecure,no_subtree_check,no_root_squash)start nfs server
systemctl start rpcbind
systemctl start nfs-server
systemctl enable rpcbind
systemctl enable nfs-servercheck nfs server
showmount -e localhost
# Output
Export list for localhost:
/data *
- Install munge service
- add user munge
(All Nodes)
sudo apt install -y build-essential git wget munge libmunge-dev libmunge2 \
mariadb-server libmariadb-dev libssl-dev libpam0g-dev \
libhwloc-dev liblua5.3-dev libreadline-dev libncurses-dev \
libjson-c-dev libyaml-dev libhttp-parser-dev libjwt-dev libdbus-glib-1-dev libbpf-dev libdbus-1-dev
which mungekey
# 如果有,使用它生成 key
sudo systemctl stop munge
sudo mungekey -c
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl start munge- copy
munge.keyfrom manager node to the rest node(All Nodes)
sudo scp /etc/munge/munge.key juice-036:/tmp/munge.key
sudo scp /etc/munge/munge.key juice-037:/tmp/munge.key
sudo scp /etc/munge/munge.key juice-038:/tmp/munge.key- grant privilege on munge.key
(All Nodes)
systemctl stop munge
sudo mv /tmp/munge.key /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl start munge
systemctl status munge
systemctl enable mungeUsing systemctl status munge to check if the service is running
- test munge
munge -n | ssh juice-036 unmunge
munge -n | ssh juice-037 unmunge
munge -n | ssh juice-038 unmunge
- Install Mariadb
(Manager Nodes)
apt-get install -y mariadb-server- create database and user
systemctl start mariadb
systemctl enable mariadb
ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'- create user
slurm,and grant all privileges on databaseslurm_acct_db
mysql -uroot -p$ROOT_PASScreate user slurm;
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
flush privileges;- create Slurm user
groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurmInstall Slurm (All Nodes)
- Install basic Debian package build requirements:
apt-get install -y build-essential fakeroot devscripts equivs- Unpack the distributed tarball:
wget https://download.schedmd.com/slurm/slurm-25.05.2.tar.bz2 -O slurm-25.05.2.tar.bz2 &&
tar -xaf slurm*tar.bz2- cd to the directory containing the Slurm source:
cd slurm-25.05.2 && mkdir -p /etc/slurm && ./configure --prefix=/usr --sysconfdir=/etc/slurm --enable-cgroupv2- compile slurm
make installmodify configuration files
(Manager Nodes)- modify
/etc/slurm/slurm.confRefer to slurm.conf
cp /root/slurm-25.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf vim /etc/slurm/slurm.conffocus on these options:
SlurmctldHost=manage AccountingStorageEnforce=associations,limits,qos AccountingStorageHost=manage AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd JobCompHost=localhost JobCompLoc=slurm_acct_db JobCompPass=123456 JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobContainerType=job_container/none JobAcctGatherType=jobacct_gather/linux- modify
/etc/slurm/slurmdbd.confRefer to slurmdbd.conf
cp /root/slurm-25.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf vim /etc/slurm/slurmdbd.conf- modify
/etc/slurm/cgroup.conf
cp /root/slurm-25.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf- send configuration files to other nodes
scp -r /etc/slurm/*.conf root@juice-037:/etc/slurm/ scp -r /etc/slurm/*.conf root@juice-038:/etc/slurm/- modify
grant privilege on some directories
(All Nodes)
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
mkdir /var/log/slurm
chown slurm: /var/log/slurm
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf- start slurm services on each node
Node:
systemctl start slurmdbd
systemctl enable slurmdbd
systemctl start slurmctld
systemctl enable slurmctld
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmd
systemctl enable slurmdTest Your Slurm Cluster (Login Node)
- check cluster configuration
scontrol show config- check cluster status
sinfo
scontrol show partition
scontrol show node- submit job
srun -N2 hostname
scontrol show jobs- check job status
check job status
squeue -a