Subsections of CheatSheet
Common Environment Variables
Variable | Description |
---|---|
$SLURM_JOB_ID | The Job ID. |
$SLURM_JOBID | Deprecated. Same as $SLURM_JOB_ID |
$SLURM_SUBMIT_HOST | The hostname of the node used for job submission. |
$SLURM_JOB_NODELIST | Contains the definition (list) of the nodes that is assigned to the job. |
$SLURM_NODELIST | Deprecated. Same as SLURM_JOB_NODELIST. |
$SLURM_CPUS_PER_TASK | Number of CPUs per task. |
$SLURM_CPUS_ON_NODE | Number of CPUs on the allocated node. |
$SLURM_JOB_CPUS_PER_NODE | Count of processors available to the job on this node. |
$SLURM_CPUS_PER_GPU | Number of CPUs requested per allocated GPU. |
$SLURM_MEM_PER_CPU | Memory per CPU. Same as –mem-per-cpu . |
$SLURM_MEM_PER_GPU | Memory per GPU. |
$SLURM_MEM_PER_NODE | Memory per node. Same as –mem . |
$SLURM_GPUS | Number of GPUs requested. |
$SLURM_NTASKS | Same as -n, –ntasks. The number of tasks. |
$SLURM_NTASKS_PER_NODE | Number of tasks requested per node. |
$SLURM_NTASKS_PER_SOCKET | Number of tasks requested per socket. |
$SLURM_NTASKS_PER_CORE | Number of tasks requested per core. |
$SLURM_NTASKS_PER_GPU | Number of tasks requested per GPU. |
$SLURM_NPROCS | Same as -n, –ntasks. See $SLURM_NTASKS. |
$SLURM_TASKS_PER_NODE | Number of tasks to be initiated on each node. |
$SLURM_ARRAY_JOB_ID | Job array’s master job ID number. |
$SLURM_ARRAY_TASK_ID | Job array ID (index) number. |
$SLURM_ARRAY_TASK_COUNT | Total number of tasks in a job array. |
$SLURM_ARRAY_TASK_MAX | Job array’s maximum ID (index) number. |
$SLURM_ARRAY_TASK_MIN | Job array’s minimum ID (index) number. |
A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.
File Operations
File Distribution
sbcast
is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.- Feature
distribute file
:Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。simplify script
:one command to distribute files to all nodes assigned to the job。imrpove performance
:Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
- Usage
- Alone
sbcast <source_file> <destination_path>
- Embedded in a job script
#!/bin/bash #SBATCH --job-name=example_job #SBATCH --output=example_job.out #SBATCH --error=example_job.err #SBATCH --partition=compute #SBATCH --nodes=4 # Use sbcast to distribute the file to the /tmp directory of each node sbcast data.txt /tmp/data.txt # Run your program using the distributed files srun my_program /tmp/data.txt
- Feature
File Collection
File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.
#SBATCH --output=output.txt #SBATCH --error=error.txt
Or
sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
Send the destination address manually Using
scp
orrsync
in the job to copy the files from the compute nodes to the submit nodeUsing NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.
Using
sbcast
Submit Jobs
3 Type Jobs
srun
is used to submit a job for execution or initiate job steps in real time.- Example
- run shell
srun -N2 bin/hostname
- run script
srun -N1 test.sh
- Example
sbatch
is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.- submit a batch job
sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm
- submit a parallel task to process differnt data partition
sbatch /data/jobs/parallel.slurm
salloc
is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.- Example
- allocate resources (more like create an virtual machine)
This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.salloc -N2 bash
when you exit the job, the resources will be released.scancel <$job_id>
- Example