CheatSheet

Common Environment Variables

Variable	Description
$SLURM_JOB_ID	The Job ID.
$SLURM_JOBID	Deprecated. Same as $SLURM_JOB_ID
$SLURM_SUBMIT_HOST	The hostname of the node used for job submission.
$SLURM_JOB_NODELIST	Contains the definition (list) of the nodes that is assigned to the job.
$SLURM_NODELIST	Deprecated. Same as SLURM_JOB_NODELIST.
$SLURM_CPUS_PER_TASK	Number of CPUs per task.
$SLURM_CPUS_ON_NODE	Number of CPUs on the allocated node.
$SLURM_JOB_CPUS_PER_NODE	Count of processors available to the job on this node.
$SLURM_CPUS_PER_GPU	Number of CPUs requested per allocated GPU.
$SLURM_MEM_PER_CPU	Memory per CPU. Same as –mem-per-cpu .
$SLURM_MEM_PER_GPU	Memory per GPU.
$SLURM_MEM_PER_NODE	Memory per node. Same as –mem .
$SLURM_GPUS	Number of GPUs requested.
$SLURM_NTASKS	Same as -n, –ntasks. The number of tasks.
$SLURM_NTASKS_PER_NODE	Number of tasks requested per node.
$SLURM_NTASKS_PER_SOCKET	Number of tasks requested per socket.
$SLURM_NTASKS_PER_CORE	Number of tasks requested per core.
$SLURM_NTASKS_PER_GPU	Number of tasks requested per GPU.
$SLURM_NPROCS	Same as -n, –ntasks. See $SLURM_NTASKS.
$SLURM_TASKS_PER_NODE	Number of tasks to be initiated on each node.
$SLURM_ARRAY_JOB_ID	Job array’s master job ID number.
$SLURM_ARRAY_TASK_ID	Job array ID (index) number.
$SLURM_ARRAY_TASK_COUNT	Total number of tasks in a job array.
$SLURM_ARRAY_TASK_MAX	Job array’s maximum ID (index) number.
$SLURM_ARRAY_TASK_MIN	Job array’s minimum ID (index) number.

A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.

File Operations

File Distribution

sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
- Feature
  1. distribute file：Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。
  2. simplify script：one command to distribute files to all nodes assigned to the job。
  3. imrpove performance：Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
- Usage
  1. Alone
```
sbcast <source_file> <destination_path>
```
  1. Embedded in a job script
```
#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --output=example_job.out
#SBATCH --error=example_job.err
#SBATCH --partition=compute
#SBATCH --nodes=4

# Use sbcast to distribute the file to the /tmp directory of each node
sbcast data.txt /tmp/data.txt

# Run your program using the distributed files
srun my_program /tmp/data.txt
```

File Collection

File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.
```
 #SBATCH --output=output.txt
 #SBATCH --error=error.txt
```
Or
```
sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
```
Send the destination address manually Using scp or rsync in the job to copy the files from the compute nodes to the submit node
Using NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.
Using sbcast

Submit Jobs

3 Type Jobs

srun is used to submit a job for execution or initiate job steps in real time.
- Example
  1. run shell
```
srun -N2 bin/hostname
```
  1. run script
```
srun -N1 test.sh
```

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

Example

submit a batch job

sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm

batch-job.slurm

#!/bin/bash

#SBATCH -N 1
#SBATCH --job-name=cpu-N1-batch
#SBATCH --partition=compute
#SBATCH --mail-type=end
#SBATCH --mail-user=xxx@email.com
#SBATCH --output=%j.out
#SBATCH --error=%j.err

srun -l /bin/hostname #you can still write srun <command> in here
srun -l pwd

submit a parallel task to process differnt data partition

sbatch /data/jobs/parallel.slurm

parallel.slurm

#!/bin/bash
#SBATCH -N 2 
#SBATCH --job-name=cpu-N2-parallel
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --array=1-4  # 定义任务数组，假设有4个分片
#SBATCH --ntasks-per-node=1 # 每个节点只运行一个任务
#SBATCH --output=process_data_%A_%a.out
#SBATCH --error=process_data_%A_%a.err

TASK_ID=${SLURM_ARRAY_TASK_ID}

DATA_PART="data_part_${TASK_ID}.txt" #make sure you have that file

if [ -f ${DATA_PART} ]; then
    echo "Processing ${DATA_PART} on node $(hostname)"
    # python process_data.py --input ${DATA_PART}
else
    echo "File ${DATA_PART} does not exist!"
fi

how to split file

split -l 1000 data.txt data_part_ 
&& mv data_part_aa data_part_1 
&& mv data_part_ab data_part_2

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
- Example
  1. allocate resources (more like create an virtual machine)
```
salloc -N2 bash
```
  This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.
```
scancel <$job_id>
```
  when you exit the job, the resources will be released.

CheatSheet

Common Environment Variables

File Operations

Submit Jobs

Subsections of CheatSheet

Common Environment Variables

File Operations

File Distribution

File Collection

Submit Jobs

3 Type Jobs