Slurm » History » Version 13
Kerstin Paech, 09/19/2013 10:16 AM
1 | 1 | Kerstin Paech | h1. How to run jobs on the euclides nodes |
---|---|---|---|
2 | 1 | Kerstin Paech | |
3 | 7 | Kerstin Paech | Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future. |
4 | 1 | Kerstin Paech | |
5 | 9 | Kerstin Paech | *Please read through this entire wikipage so everyone can make efficient use of this cluster* |
6 | 9 | Kerstin Paech | |
7 | 1 | Kerstin Paech | h2. alexandria |
8 | 1 | Kerstin Paech | |
9 | 1 | Kerstin Paech | *Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. |
10 | 1 | Kerstin Paech | |
11 | 1 | Kerstin Paech | You should use alexandria to |
12 | 1 | Kerstin Paech | - transfer files |
13 | 1 | Kerstin Paech | - compile your code |
14 | 1 | Kerstin Paech | - submit jobs to the nodes |
15 | 1 | Kerstin Paech | |
16 | 1 | Kerstin Paech | If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below. |
17 | 1 | Kerstin Paech | |
18 | 1 | Kerstin Paech | h2. euclides nodes |
19 | 1 | Kerstin Paech | |
20 | 1 | Kerstin Paech | Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). |
21 | 1 | Kerstin Paech | *Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin* |
22 | 1 | Kerstin Paech | |
23 | 4 | Kerstin Paech | All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). |
24 | 4 | Kerstin Paech | |
25 | 4 | Kerstin Paech | If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf. |
26 | 1 | Kerstin Paech | |
27 | 1 | Kerstin Paech | h3. Scheduling of Jobs |
28 | 1 | Kerstin Paech | |
29 | 9 | Kerstin Paech | At this point there are two queues, called partitions in slurm: |
30 | 9 | Kerstin Paech | * *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of |
31 | 9 | Kerstin Paech | two days. Jobs at this point can only run on 1 node. |
32 | 9 | Kerstin Paech | * *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue |
33 | 1 | Kerstin Paech | |
34 | 9 | Kerstin Paech | We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending |
35 | 9 | Kerstin Paech | on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much |
36 | 9 | Kerstin Paech | resources it will consume. |
37 | 9 | Kerstin Paech | |
38 | 9 | Kerstin Paech | This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex |
39 | 9 | Kerstin Paech | issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if |
40 | 9 | Kerstin Paech | there is something that can be improved without creating an unfair disadvantage for other users. |
41 | 9 | Kerstin Paech | |
42 | 9 | Kerstin Paech | You can run interactive jobs on both partitions. |
43 | 9 | Kerstin Paech | |
44 | 1 | Kerstin Paech | h3. Running an interactive job with slurm |
45 | 1 | Kerstin Paech | |
46 | 9 | Kerstin Paech | To run an interactive job with slurm in the default partition, use |
47 | 1 | Kerstin Paech | |
48 | 1 | Kerstin Paech | <pre> |
49 | 12 | Kerstin Paech | srun -u --pty bash -i |
50 | 1 | Kerstin Paech | </pre> |
51 | 9 | Kerstin Paech | |
52 | 9 | Kerstin Paech | In case the 'normal' partition is overcrowded, to use the 'debug' partition, use: |
53 | 9 | Kerstin Paech | <pre> |
54 | 13 | Kerstin Paech | srun --account cosmo_debug -p debug -u --pty bash -i # if you are part of the Cosmology group |
55 | 13 | Kerstin Paech | srun --account euclid_debug -p debug -u --pty bash -i # if you are part of the EuclidDM group |
56 | 12 | Kerstin Paech | </pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes. |
57 | 1 | Kerstin Paech | |
58 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the default partition |
59 | 1 | Kerstin Paech | |
60 | 1 | Kerstin Paech | * To see what queues are available to you (called partitions in slurm), run: |
61 | 1 | Kerstin Paech | <pre> |
62 | 1 | Kerstin Paech | sinfo |
63 | 1 | Kerstin Paech | </pre> |
64 | 1 | Kerstin Paech | |
65 | 1 | Kerstin Paech | * To run slurm, create a myjob.slurm containing the following information: |
66 | 1 | Kerstin Paech | <pre> |
67 | 1 | Kerstin Paech | #!/bin/bash |
68 | 1 | Kerstin Paech | #SBATCH --output=slurm.out |
69 | 1 | Kerstin Paech | #SBATCH --error=slurm.err |
70 | 1 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
71 | 1 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
72 | 8 | Kerstin Paech | #SBATCH -p normal |
73 | 1 | Kerstin Paech | |
74 | 1 | Kerstin Paech | /bin/hostname |
75 | 1 | Kerstin Paech | </pre> |
76 | 1 | Kerstin Paech | |
77 | 1 | Kerstin Paech | * To submit a batch job use: |
78 | 1 | Kerstin Paech | <pre> |
79 | 1 | Kerstin Paech | sbatch myjob.slurm |
80 | 1 | Kerstin Paech | </pre> |
81 | 1 | Kerstin Paech | |
82 | 1 | Kerstin Paech | * To see the status of you job, use |
83 | 1 | Kerstin Paech | <pre> |
84 | 1 | Kerstin Paech | squeue |
85 | 1 | Kerstin Paech | </pre> |
86 | 1 | Kerstin Paech | |
87 | 11 | Kerstin Paech | * To kill a job use: |
88 | 11 | Kerstin Paech | <pre> |
89 | 11 | Kerstin Paech | scancel <jobid> |
90 | 11 | Kerstin Paech | </pre> the <jobid> you can get from using squeue. |
91 | 11 | Kerstin Paech | |
92 | 1 | Kerstin Paech | * For some more information on your job use |
93 | 1 | Kerstin Paech | <pre> |
94 | 1 | Kerstin Paech | scontrol show job <jobid> |
95 | 11 | Kerstin Paech | </pre>the <jobid> you can get from using squeue. |
96 | 1 | Kerstin Paech | |
97 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the debug partition |
98 | 10 | Kerstin Paech | |
99 | 10 | Kerstin Paech | Change the partition to debug and add the appropriate account depending if you're part of |
100 | 10 | Kerstin Paech | the euclid or cosmology group. |
101 | 10 | Kerstin Paech | |
102 | 10 | Kerstin Paech | <pre> |
103 | 10 | Kerstin Paech | #!/bin/bash |
104 | 10 | Kerstin Paech | #SBATCH --output=slurm.out |
105 | 10 | Kerstin Paech | #SBATCH --error=slurm.err |
106 | 10 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
107 | 10 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
108 | 10 | Kerstin Paech | #SBATCH -p debug |
109 | 10 | Kerstin Paech | #SBATCH -account [cosmo_debug/euclid_debug] |
110 | 10 | Kerstin Paech | |
111 | 10 | Kerstin Paech | /bin/hostname |
112 | 10 | Kerstin Paech | </pre> |
113 | 10 | Kerstin Paech | |
114 | 10 | Kerstin Paech | |
115 | 6 | Kerstin Paech | h3. Batch script for running a multi-core job |
116 | 6 | Kerstin Paech | |
117 | 6 | Kerstin Paech | To run a 4 core job you can use |
118 | 6 | Kerstin Paech | <pre> |
119 | 6 | Kerstin Paech | #!/bin/bash |
120 | 6 | Kerstin Paech | #SBATCH --output=slurm.out |
121 | 6 | Kerstin Paech | #SBATCH --error=slurm.err |
122 | 6 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
123 | 6 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
124 | 1 | Kerstin Paech | #SBATCH -n 4 |
125 | 6 | Kerstin Paech | |
126 | 10 | Kerstin Paech | <mpirun call/pogram> |
127 | 6 | Kerstin Paech | |
128 | 6 | Kerstin Paech | </pre> |