Subsections of CheatSheet

Common Environment Variables

VariableDescription
$SLURM_JOB_IDThe Job ID.
$SLURM_JOBIDDeprecated. Same as $SLURM_JOB_ID
$SLURM_SUBMIT_HOSTThe hostname of the node used for job submission.
$SLURM_JOB_NODELISTContains the definition (list) of the nodes that is assigned to the job.
$SLURM_NODELISTDeprecated. Same as SLURM_JOB_NODELIST.
$SLURM_CPUS_PER_TASKNumber of CPUs per task.
$SLURM_CPUS_ON_NODENumber of CPUs on the allocated node.
$SLURM_JOB_CPUS_PER_NODECount of processors available to the job on this node.
$SLURM_CPUS_PER_GPUNumber of CPUs requested per allocated GPU.
$SLURM_MEM_PER_CPUMemory per CPU. Same as –mem-per-cpu .
$SLURM_MEM_PER_GPUMemory per GPU.
$SLURM_MEM_PER_NODEMemory per node. Same as –mem .
$SLURM_GPUSNumber of GPUs requested.
$SLURM_NTASKSSame as -n, –ntasks. The number of tasks.
$SLURM_NTASKS_PER_NODENumber of tasks requested per node.
$SLURM_NTASKS_PER_SOCKETNumber of tasks requested per socket.
$SLURM_NTASKS_PER_CORENumber of tasks requested per core.
$SLURM_NTASKS_PER_GPUNumber of tasks requested per GPU.
$SLURM_NPROCSSame as -n, –ntasks. See $SLURM_NTASKS.
$SLURM_TASKS_PER_NODENumber of tasks to be initiated on each node.
$SLURM_ARRAY_JOB_IDJob array’s master job ID number.
$SLURM_ARRAY_TASK_IDJob array ID (index) number.
$SLURM_ARRAY_TASK_COUNTTotal number of tasks in a job array.
$SLURM_ARRAY_TASK_MAXJob array’s maximum ID (index) number.
$SLURM_ARRAY_TASK_MINJob array’s minimum ID (index) number.

A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.

File Operations

File Distribution

  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
    • Feature
      1. distribute file:Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。
      2. simplify script:one command to distribute files to all nodes assigned to the job。
      3. imrpove performance:Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
    • Usage
      1. Alone
      sbcast <source_file> <destination_path>
      1. Embedded in a job script
      #!/bin/bash
      #SBATCH --job-name=example_job
      #SBATCH --output=example_job.out
      #SBATCH --error=example_job.err
      #SBATCH --partition=compute
      #SBATCH --nodes=4
      
      # Use sbcast to distribute the file to the /tmp directory of each node
      sbcast data.txt /tmp/data.txt
      
      # Run your program using the distributed files
      srun my_program /tmp/data.txt

File Collection

  1. File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.

     #SBATCH --output=output.txt
     #SBATCH --error=error.txt

    Or

    sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
  2. Send the destination address manually Using scp or rsync in the job to copy the files from the compute nodes to the submit node

  3. Using NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.

  4. Using sbcast

Submit Jobs

3 Type Jobs

  • srun is used to submit a job for execution or initiate job steps in real time.

    • Example
      1. run shell
      srun -N2 bin/hostname
      1. run script
      srun -N1 test.sh
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

    • Example

      1. submit a batch job
      sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm
      #!/bin/bash
      
      #SBATCH -N 1
      #SBATCH --job-name=cpu-N1-batch
      #SBATCH --partition=compute
      #SBATCH --mail-type=end
      #SBATCH --mail-user=xxx@email.com
      #SBATCH --output=%j.out
      #SBATCH --error=%j.err
      
      srun -l /bin/hostname #you can still write srun <command> in here
      srun -l pwd
      
      1. submit a parallel task to process differnt data partition
      sbatch /data/jobs/parallel.slurm
      #!/bin/bash
      #SBATCH -N 2 
      #SBATCH --job-name=cpu-N2-parallel
      #SBATCH --partition=compute
      #SBATCH --time=01:00:00
      #SBATCH --array=1-4  # 定义任务数组,假设有4个分片
      #SBATCH --ntasks-per-node=1 # 每个节点只运行一个任务
      #SBATCH --output=process_data_%A_%a.out
      #SBATCH --error=process_data_%A_%a.err
      
      TASK_ID=${SLURM_ARRAY_TASK_ID}
      
      DATA_PART="data_part_${TASK_ID}.txt" #make sure you have that file
      
      if [ -f ${DATA_PART} ]; then
          echo "Processing ${DATA_PART} on node $(hostname)"
          # python process_data.py --input ${DATA_PART}
      else
          echo "File ${DATA_PART} does not exist!"
      fi
      
      split -l 1000 data.txt data_part_ 
      && mv data_part_aa data_part_1 
      && mv data_part_ab data_part_2
      
  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

    • Example
      1. allocate resources (more like create an virtual machine)
      salloc -N2 bash
      This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.
      scancel <$job_id>
      when you exit the job, the resources will be released.