Version 23 - History - Slurm - Cluster Cosmology - Redmine

Slurm » History » Version 23

Kerstin Paech, 10/10/2013 07:02 AM

-Kerstin Paech
+{{toc}}
 Kerstin Paech
-Kerstin Paech
+h1. How to run jobs on the euclides nodes
 Kerstin Paech
-Kerstin Paech
+Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future.
 Kerstin Paech
-Kerstin Paech
+*Please read through this entire wikipage so everyone can make efficient use of this cluster*
 Kerstin Paech
-Kerstin Paech
+h2. alexandria
 Kerstin Paech
-Kerstin Paech
+*Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
 Kerstin Paech
-Kerstin Paech
+You should use alexandria to
-Kerstin Paech
+- transfer files
-Kerstin Paech
+- compile your code
-Kerstin Paech
+- submit jobs to the nodes
 Kerstin Paech
-Kerstin Paech
+If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below.
 Kerstin Paech
-Kerstin Paech
+h2. euclides nodes
 Kerstin Paech
-Kerstin Paech
+Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/).
-Kerstin Paech
+*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin*
 Kerstin Paech
-Kerstin Paech
+All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...).
 Kerstin Paech
-Kerstin Paech
+If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
 Kerstin Paech
-Kerstin Paech
+h3. Scheduling of Jobs
 Kerstin Paech
-Kerstin Paech
+At this point there are two queues, called partitions in slurm:
-Kerstin Paech
+* *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
-Kerstin Paech
+two days. Jobs at this point can only run on 1 node.
-Kerstin Paech
+* *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
-Kerstin Paech
+hours.
 Kerstin Paech
-Kerstin Paech
+We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
-Kerstin Paech
+on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
-Kerstin Paech
+resources it will consume.
 Kerstin Paech
-Kerstin Paech
+This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
-Kerstin Paech
+issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
-Kerstin Paech
+there is something that can be improved without creating an unfair disadvantage for other users.
 Kerstin Paech
-Kerstin Paech
+You can run interactive jobs on both partitions.
 Kerstin Paech
-Kerstin Paech
+h3. Running an interactive job with slurm
 Kerstin Paech
-Kerstin Paech
+To run an interactive job with slurm in the default partition, use
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun -u --pty bash
-Kerstin Paech
+</pre>
 Kerstin Paech
-Shantanu Desai
+If you want to use tcsh use
 Shantanu Desai
-Shantanu Desai
+<pre>
-Shantanu Desai
+srun -u --pty tcsh
-Shantanu Desai
+</pre>
 Shantanu Desai
-Kerstin Paech
+In case you want to open x11 applications, use the --x11=first option, e.g.
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun --x11=first -u   --pty  bash
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+In case the 'normal' partition is overcrowded, to use the 'debug' partition, use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
-Kerstin Paech
+srun --account euclid_debug -p debug -u --pty bash  # if you are part of the EuclidDM group
-Kerstin Paech
+</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
 Kerstin Paech
-Kerstin Paech
+h3. Running a simple once core batch job with slurm using the default partition
 Kerstin Paech
-Kerstin Paech
+* To see what queues are available to you (called partitions in slurm), run:
-Kerstin Paech
+<pre>
-Kerstin Paech
+sinfo
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To run slurm, create a myjob.slurm containing the following information:
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -p normal
 Kerstin Paech
-Kerstin Paech
+/bin/hostname
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To submit a batch job use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+sbatch myjob.slurm
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To see the status of you job, use
-Kerstin Paech
+<pre>
-Kerstin Paech
+squeue
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To kill a job use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+scancel <jobid>
-Kerstin Paech
+</pre> the <jobid> you can get from using squeue.
 Kerstin Paech
-Kerstin Paech
+* For some more information on your job use
-Kerstin Paech
+<pre>
-Kerstin Paech
+scontrol show job <jobid>
-Kerstin Paech
+</pre>the <jobid> you can get from using squeue.
 Kerstin Paech
-Kerstin Paech
+h3. Running a simple once core batch job with slurm using the debug partition
 Kerstin Paech
-Kerstin Paech
+Change the partition to debug and add the appropriate account depending if you're part of
-Kerstin Paech
+the euclid or cosmology group.
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -p debug
-Kerstin Paech
+#SBATCH -account [cosmo_debug/euclid_debug]
 Kerstin Paech
-Kerstin Paech
+/bin/hostname
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+h3. Accessing a node where a job is running or starting additional processes on a node
 Kerstin Paech
-Kerstin Paech
+You can attach and srun command to an already existing job (batch or interactive). This
-Kerstin Paech
+means you can start an interactive session on a node where a job of yours is running
-Kerstin Paech
+or start and additional process.
 Kerstin Paech
-Kerstin Paech
+First determine the jobid of the desired job using squeue, then use
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun  --jobid <jobid> [options] <executable>
-Kerstin Paech
+</pre>
-Kerstin Paech
+Or more concrete
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun  --jobid <jobid> -u --pty  bash # to start an interactive session
-Kerstin Paech
+srun  --jobid <jobid> ps -eaFAl  # to start get detailed process information
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+The processes will only run on cores that have been allocated to you.
-Kerstin Paech
+This works for batch as well as interactive jobs.
-Kerstin Paech
+*Important: If the original job that was submitted is finished, any process
-Kerstin Paech
+attached in this fashion will be killed.*
 Kerstin Paech
 Kerstin Paech
-Kerstin Paech
+h3. Batch script for running a multi-core job
 Kerstin Paech
-Kerstin Paech
+mpi is installed on alexandria.
 Kerstin Paech
-Kerstin Paech
+To run a 4 core job for an executable compiled with mpi you can use
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -n 4
 Kerstin Paech
-Kerstin Paech
+mpirun <programname>
 Kerstin Paech
-Kerstin Paech
+</pre>
-Kerstin Paech
+and it will automatically start on the number of nodes specified.
 Kerstin Paech
-Kerstin Paech
+To ensure that the job is being executed on only one node, add
-Kerstin Paech
+<pre>
-Kerstin Paech
+#SBATCH -n 4
-Kerstin Paech
+</pre>
-Kerstin Paech
+to the job script.
 Kerstin Paech
-Kerstin Paech
+If you would like to run a program that itself starts processes, you can use the
-Kerstin Paech
+environment variable $SLURM_NPROCS that is automatically defined for slurm
-Kerstin Paech
+jobs to explicitly pass the number of cores the program can run on.
 Kerstin Paech
-Kerstin Paech
+To check if your job is acutally running on the specified number of cores, you can check
-Kerstin Paech
+the PSR column of
-Kerstin Paech
+<pre>
-Kerstin Paech
+ps -eaFAl
-Kerstin Paech
+# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
-Kerstin Paech
+</pre>

Project

General

Profile

Cluster Cosmology

Slurm » History » Version 23