Slurm » History » Version 23
Kerstin Paech, 10/10/2013 07:02 AM
1 | 21 | Kerstin Paech | {{toc}} |
---|---|---|---|
2 | 21 | Kerstin Paech | |
3 | 1 | Kerstin Paech | h1. How to run jobs on the euclides nodes |
4 | 1 | Kerstin Paech | |
5 | 7 | Kerstin Paech | Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future. |
6 | 1 | Kerstin Paech | |
7 | 9 | Kerstin Paech | *Please read through this entire wikipage so everyone can make efficient use of this cluster* |
8 | 9 | Kerstin Paech | |
9 | 1 | Kerstin Paech | h2. alexandria |
10 | 1 | Kerstin Paech | |
11 | 1 | Kerstin Paech | *Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. |
12 | 1 | Kerstin Paech | |
13 | 1 | Kerstin Paech | You should use alexandria to |
14 | 1 | Kerstin Paech | - transfer files |
15 | 1 | Kerstin Paech | - compile your code |
16 | 1 | Kerstin Paech | - submit jobs to the nodes |
17 | 1 | Kerstin Paech | |
18 | 1 | Kerstin Paech | If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below. |
19 | 1 | Kerstin Paech | |
20 | 1 | Kerstin Paech | h2. euclides nodes |
21 | 1 | Kerstin Paech | |
22 | 1 | Kerstin Paech | Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). |
23 | 1 | Kerstin Paech | *Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin* |
24 | 1 | Kerstin Paech | |
25 | 4 | Kerstin Paech | All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). |
26 | 4 | Kerstin Paech | |
27 | 4 | Kerstin Paech | If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf. |
28 | 1 | Kerstin Paech | |
29 | 1 | Kerstin Paech | h3. Scheduling of Jobs |
30 | 1 | Kerstin Paech | |
31 | 9 | Kerstin Paech | At this point there are two queues, called partitions in slurm: |
32 | 9 | Kerstin Paech | * *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of |
33 | 9 | Kerstin Paech | two days. Jobs at this point can only run on 1 node. |
34 | 16 | Kerstin Paech | * *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is |
35 | 16 | Kerstin Paech | 12 hours. |
36 | 1 | Kerstin Paech | |
37 | 9 | Kerstin Paech | We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending |
38 | 9 | Kerstin Paech | on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much |
39 | 9 | Kerstin Paech | resources it will consume. |
40 | 9 | Kerstin Paech | |
41 | 9 | Kerstin Paech | This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex |
42 | 9 | Kerstin Paech | issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if |
43 | 9 | Kerstin Paech | there is something that can be improved without creating an unfair disadvantage for other users. |
44 | 9 | Kerstin Paech | |
45 | 9 | Kerstin Paech | You can run interactive jobs on both partitions. |
46 | 9 | Kerstin Paech | |
47 | 1 | Kerstin Paech | h3. Running an interactive job with slurm |
48 | 1 | Kerstin Paech | |
49 | 9 | Kerstin Paech | To run an interactive job with slurm in the default partition, use |
50 | 1 | Kerstin Paech | |
51 | 1 | Kerstin Paech | <pre> |
52 | 14 | Kerstin Paech | srun -u --pty bash |
53 | 1 | Kerstin Paech | </pre> |
54 | 9 | Kerstin Paech | |
55 | 15 | Shantanu Desai | If you want to use tcsh use |
56 | 15 | Shantanu Desai | |
57 | 15 | Shantanu Desai | <pre> |
58 | 15 | Shantanu Desai | srun -u --pty tcsh |
59 | 15 | Shantanu Desai | </pre> |
60 | 15 | Shantanu Desai | |
61 | 20 | Kerstin Paech | In case you want to open x11 applications, use the --x11=first option, e.g. |
62 | 20 | Kerstin Paech | <pre> |
63 | 20 | Kerstin Paech | srun --x11=first -u --pty bash |
64 | 20 | Kerstin Paech | </pre> |
65 | 20 | Kerstin Paech | |
66 | 9 | Kerstin Paech | In case the 'normal' partition is overcrowded, to use the 'debug' partition, use: |
67 | 9 | Kerstin Paech | <pre> |
68 | 14 | Kerstin Paech | srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group |
69 | 14 | Kerstin Paech | srun --account euclid_debug -p debug -u --pty bash # if you are part of the EuclidDM group |
70 | 12 | Kerstin Paech | </pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes. |
71 | 1 | Kerstin Paech | |
72 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the default partition |
73 | 1 | Kerstin Paech | |
74 | 1 | Kerstin Paech | * To see what queues are available to you (called partitions in slurm), run: |
75 | 1 | Kerstin Paech | <pre> |
76 | 1 | Kerstin Paech | sinfo |
77 | 1 | Kerstin Paech | </pre> |
78 | 1 | Kerstin Paech | |
79 | 1 | Kerstin Paech | * To run slurm, create a myjob.slurm containing the following information: |
80 | 1 | Kerstin Paech | <pre> |
81 | 1 | Kerstin Paech | #!/bin/bash |
82 | 1 | Kerstin Paech | #SBATCH --output=slurm.out |
83 | 1 | Kerstin Paech | #SBATCH --error=slurm.err |
84 | 1 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
85 | 1 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
86 | 8 | Kerstin Paech | #SBATCH -p normal |
87 | 1 | Kerstin Paech | |
88 | 1 | Kerstin Paech | /bin/hostname |
89 | 1 | Kerstin Paech | </pre> |
90 | 1 | Kerstin Paech | |
91 | 1 | Kerstin Paech | * To submit a batch job use: |
92 | 1 | Kerstin Paech | <pre> |
93 | 1 | Kerstin Paech | sbatch myjob.slurm |
94 | 1 | Kerstin Paech | </pre> |
95 | 1 | Kerstin Paech | |
96 | 1 | Kerstin Paech | * To see the status of you job, use |
97 | 1 | Kerstin Paech | <pre> |
98 | 1 | Kerstin Paech | squeue |
99 | 1 | Kerstin Paech | </pre> |
100 | 1 | Kerstin Paech | |
101 | 11 | Kerstin Paech | * To kill a job use: |
102 | 11 | Kerstin Paech | <pre> |
103 | 11 | Kerstin Paech | scancel <jobid> |
104 | 11 | Kerstin Paech | </pre> the <jobid> you can get from using squeue. |
105 | 11 | Kerstin Paech | |
106 | 1 | Kerstin Paech | * For some more information on your job use |
107 | 1 | Kerstin Paech | <pre> |
108 | 1 | Kerstin Paech | scontrol show job <jobid> |
109 | 11 | Kerstin Paech | </pre>the <jobid> you can get from using squeue. |
110 | 1 | Kerstin Paech | |
111 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the debug partition |
112 | 10 | Kerstin Paech | |
113 | 10 | Kerstin Paech | Change the partition to debug and add the appropriate account depending if you're part of |
114 | 10 | Kerstin Paech | the euclid or cosmology group. |
115 | 10 | Kerstin Paech | |
116 | 10 | Kerstin Paech | <pre> |
117 | 10 | Kerstin Paech | #!/bin/bash |
118 | 10 | Kerstin Paech | #SBATCH --output=slurm.out |
119 | 10 | Kerstin Paech | #SBATCH --error=slurm.err |
120 | 10 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
121 | 10 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
122 | 10 | Kerstin Paech | #SBATCH -p debug |
123 | 10 | Kerstin Paech | #SBATCH -account [cosmo_debug/euclid_debug] |
124 | 10 | Kerstin Paech | |
125 | 10 | Kerstin Paech | /bin/hostname |
126 | 10 | Kerstin Paech | </pre> |
127 | 10 | Kerstin Paech | |
128 | 22 | Kerstin Paech | h3. Accessing a node where a job is running or starting additional processes on a node |
129 | 22 | Kerstin Paech | |
130 | 22 | Kerstin Paech | You can attach and srun command to an already existing job (batch or interactive). This |
131 | 22 | Kerstin Paech | means you can start an interactive session on a node where a job of yours is running |
132 | 22 | Kerstin Paech | or start and additional process. |
133 | 22 | Kerstin Paech | |
134 | 22 | Kerstin Paech | First determine the jobid of the desired job using squeue, then use |
135 | 22 | Kerstin Paech | |
136 | 22 | Kerstin Paech | <pre> |
137 | 22 | Kerstin Paech | srun --jobid <jobid> [options] <executable> |
138 | 22 | Kerstin Paech | </pre> |
139 | 22 | Kerstin Paech | Or more concrete |
140 | 22 | Kerstin Paech | <pre> |
141 | 22 | Kerstin Paech | srun --jobid <jobid> -u --pty bash # to start an interactive session |
142 | 22 | Kerstin Paech | srun --jobid <jobid> ps -eaFAl # to start get detailed process information |
143 | 22 | Kerstin Paech | </pre> |
144 | 22 | Kerstin Paech | |
145 | 22 | Kerstin Paech | The processes will only run on cores that have been allocated to you. |
146 | 23 | Kerstin Paech | This works for batch as well as interactive jobs. |
147 | 23 | Kerstin Paech | *Important: If the original job that was submitted is finished, any process |
148 | 23 | Kerstin Paech | attached in this fashion will be killed.* |
149 | 22 | Kerstin Paech | |
150 | 10 | Kerstin Paech | |
151 | 6 | Kerstin Paech | h3. Batch script for running a multi-core job |
152 | 6 | Kerstin Paech | |
153 | 17 | Kerstin Paech | mpi is installed on alexandria. |
154 | 17 | Kerstin Paech | |
155 | 18 | Kerstin Paech | To run a 4 core job for an executable compiled with mpi you can use |
156 | 6 | Kerstin Paech | <pre> |
157 | 6 | Kerstin Paech | #!/bin/bash |
158 | 6 | Kerstin Paech | #SBATCH --output=slurm.out |
159 | 6 | Kerstin Paech | #SBATCH --error=slurm.err |
160 | 6 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
161 | 6 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
162 | 6 | Kerstin Paech | #SBATCH -n 4 |
163 | 1 | Kerstin Paech | |
164 | 18 | Kerstin Paech | mpirun <programname> |
165 | 1 | Kerstin Paech | |
166 | 1 | Kerstin Paech | </pre> |
167 | 18 | Kerstin Paech | and it will automatically start on the number of nodes specified. |
168 | 1 | Kerstin Paech | |
169 | 18 | Kerstin Paech | To ensure that the job is being executed on only one node, add |
170 | 18 | Kerstin Paech | <pre> |
171 | 18 | Kerstin Paech | #SBATCH -n 4 |
172 | 18 | Kerstin Paech | </pre> |
173 | 18 | Kerstin Paech | to the job script. |
174 | 17 | Kerstin Paech | |
175 | 19 | Kerstin Paech | If you would like to run a program that itself starts processes, you can use the |
176 | 19 | Kerstin Paech | environment variable $SLURM_NPROCS that is automatically defined for slurm |
177 | 19 | Kerstin Paech | jobs to explicitly pass the number of cores the program can run on. |
178 | 19 | Kerstin Paech | |
179 | 17 | Kerstin Paech | To check if your job is acutally running on the specified number of cores, you can check |
180 | 17 | Kerstin Paech | the PSR column of |
181 | 17 | Kerstin Paech | <pre> |
182 | 17 | Kerstin Paech | ps -eaFAl |
183 | 17 | Kerstin Paech | # or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs |
184 | 6 | Kerstin Paech | </pre> |