Project

General

Profile

Slurm » History » Version 76

Martin Kuemmel, 07/24/2017 07:21 AM

1 21 Kerstin Paech
{{toc}}
2 21 Kerstin Paech
3 53 Sebastian Bocquet
h1. Hardware overview
4 53 Sebastian Bocquet
5 67 Martin Kuemmel
You access the Euclid cluster through either cosmogw.kosmo.physik.uni-muenchen.de or cosmofs1.kosmo.physik.uni-muenchen.de
6 1 Kerstin Paech
7 67 Martin Kuemmel
* cosmogw and cosmofs1 are gateway machines and should *not* be used for computing
8 74 Martin Kuemmel
* there are 21 compute nodes named euclides01--euclides11, euclides12-os--euclides17-os (called the os-machines hereafter) and euclides18--euclides21
9 1 Kerstin Paech
* euclides01-05 are available via cosmofs1, and euclides05 can only be used for debugging, see below
10 74 Martin Kuemmel
* euclides06-21 (including the os-machines) are available via cosmogw; 
11 74 Martin Kuemmel
* euclides01--euclides11 have each 32 logical CPUs and 64GB of RAM
12 74 Martin Kuemmel
* euclides12--euclides21 have each 56 logical CPUs and 128GB of RAM
13 74 Martin Kuemmel
* the os-machines are connected through a 1Gbit link, all others via a 10GBit link;
14 53 Sebastian Bocquet
15 46 Roy Henderson
h1. How to run jobs on the euclides nodes (using Slurm)
16 1 Kerstin Paech
17 74 Martin Kuemmel
Use slurm to submit jobs or login to the euclides nodes (euclides01-21).
18 42 Kerstin Paech
19 9 Kerstin Paech
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
20 9 Kerstin Paech
21 68 Martin Kuemmel
h2. cosmogw/cosmofs1
22 1 Kerstin Paech
23 68 Martin Kuemmel
*Please do not use csmogw or cosmofs1 as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. 
24 1 Kerstin Paech
25 68 Martin Kuemmel
You should use cosmogw or cosmofs1 to
26 1 Kerstin Paech
* transfer files
27 51 Sebastian Bocquet
* compile your code
28 68 Martin Kuemmel
* submit jobs to the nodes via the slurm queues
29 51 Sebastian Bocquet
30 51 Sebastian Bocquet
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
31 51 Sebastian Bocquet
32 51 Sebastian Bocquet
h2. euclides nodes
33 51 Sebastian Bocquet
34 1 Kerstin Paech
35 1 Kerstin Paech
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). 
36 52 Sebastian Bocquet
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin*
37 1 Kerstin Paech
38 4 Kerstin Paech
All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). 
39 4 Kerstin Paech
40 4 Kerstin Paech
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
41 1 Kerstin Paech
42 1 Kerstin Paech
h3. Scheduling of Jobs
43 1 Kerstin Paech
44 75 Martin Kuemmel
At this point there are four queues, called partitions in slurm:
45 69 Martin Kuemmel
* on cosmofs1:
46 69 Martin Kuemmel
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
47 1 Kerstin Paech
two days. Jobs at this point can only run on 1 node.
48 69 Martin Kuemmel
** *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
49 16 Kerstin Paech
12 hours.
50 70 Martin Kuemmel
* on cosmofgw:
51 70 Martin Kuemmel
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
52 75 Martin Kuemmel
four days; this queue comprises the computing nodes euclides06-11 and euclides18-21;
53 75 Martin Kuemmel
** the *slow* partition comprises the os nodes eucides12-os -- euclides17-os (slow due to the 1Gbit link). There is a time limit of 12 hours on this queue.
54 1 Kerstin Paech
55 38 Kerstin Paech
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
56 38 Kerstin Paech
57 9 Kerstin Paech
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
58 9 Kerstin Paech
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
59 9 Kerstin Paech
resources it will consume.
60 9 Kerstin Paech
61 9 Kerstin Paech
This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
62 9 Kerstin Paech
issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
63 9 Kerstin Paech
there is something that can be improved without creating an unfair disadvantage for other users.
64 9 Kerstin Paech
65 9 Kerstin Paech
You can run interactive jobs on both partitions.
66 9 Kerstin Paech
67 41 Kerstin Paech
h3. Running an interactive job with slurm (a.k.a. logging in)
68 1 Kerstin Paech
69 9 Kerstin Paech
To run an interactive job with slurm in the default partition, use
70 1 Kerstin Paech
71 1 Kerstin Paech
<pre>
72 14 Kerstin Paech
srun -u --pty bash
73 1 Kerstin Paech
</pre>
74 9 Kerstin Paech
75 15 Shantanu Desai
If you want to use tcsh use
76 15 Shantanu Desai
77 15 Shantanu Desai
<pre>
78 15 Shantanu Desai
srun -u --pty tcsh
79 15 Shantanu Desai
</pre>
80 15 Shantanu Desai
81 30 Shantanu Desai
If you want to use a larger memory per job do
82 30 Shantanu Desai
83 30 Shantanu Desai
<pre>
84 31 Shantanu Desai
srun -u --mem-per-cpu=8000 --pty tcsh
85 30 Shantanu Desai
</pre>
86 30 Shantanu Desai
87 20 Kerstin Paech
In case you want to open x11 applications, use the --x11=first option, e.g.
88 20 Kerstin Paech
<pre>
89 20 Kerstin Paech
srun --x11=first -u   --pty  bash
90 20 Kerstin Paech
</pre>
91 20 Kerstin Paech
92 71 Martin Kuemmel
In case the 'normal' partition on cosmofs1 is overcrowded, to use the 'debug' partition, use:
93 9 Kerstin Paech
<pre>
94 14 Kerstin Paech
srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
95 14 Kerstin Paech
srun --account euclid_debug -p debug -u --pty bash  # if you are part of the EuclidDM group
96 12 Kerstin Paech
</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
97 1 Kerstin Paech
98 44 Kerstin Paech
h3. limited ssh access
99 44 Kerstin Paech
100 44 Kerstin Paech
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
101 44 Kerstin Paech
102 10 Kerstin Paech
h3. Running a simple once core batch job with slurm using the default partition
103 1 Kerstin Paech
104 1 Kerstin Paech
* To see what queues are available to you (called partitions in slurm), run:
105 1 Kerstin Paech
<pre>
106 1 Kerstin Paech
sinfo
107 1 Kerstin Paech
</pre>
108 1 Kerstin Paech
109 1 Kerstin Paech
* To run slurm, create a myjob.slurm containing the following information:
110 1 Kerstin Paech
<pre>
111 1 Kerstin Paech
#!/bin/bash
112 1 Kerstin Paech
#SBATCH --output=slurm.out
113 1 Kerstin Paech
#SBATCH --error=slurm.err
114 1 Kerstin Paech
#SBATCH --mail-user <put your email address here>
115 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
116 8 Kerstin Paech
#SBATCH -p normal
117 1 Kerstin Paech
118 1 Kerstin Paech
/bin/hostname
119 1 Kerstin Paech
</pre>
120 1 Kerstin Paech
121 1 Kerstin Paech
* To submit a batch job use:
122 1 Kerstin Paech
<pre>
123 1 Kerstin Paech
sbatch myjob.slurm
124 1 Kerstin Paech
</pre>
125 1 Kerstin Paech
126 1 Kerstin Paech
* To see the status of you job, use 
127 1 Kerstin Paech
<pre>
128 1 Kerstin Paech
squeue
129 1 Kerstin Paech
</pre>
130 1 Kerstin Paech
131 11 Kerstin Paech
* To kill a job use:
132 11 Kerstin Paech
<pre>
133 11 Kerstin Paech
scancel <jobid>
134 11 Kerstin Paech
</pre> the <jobid> you can get from using squeue.
135 11 Kerstin Paech
136 1 Kerstin Paech
* For some more information on your job use
137 1 Kerstin Paech
<pre>
138 1 Kerstin Paech
scontrol show job <jobid>
139 11 Kerstin Paech
</pre>the <jobid> you can get from using squeue.
140 1 Kerstin Paech
141 10 Kerstin Paech
h3. Running a simple once core batch job with slurm using the debug partition
142 10 Kerstin Paech
143 10 Kerstin Paech
Change the partition to debug and add the appropriate account depending if you're part of
144 10 Kerstin Paech
the euclid or cosmology group.
145 10 Kerstin Paech
146 10 Kerstin Paech
<pre>
147 10 Kerstin Paech
#!/bin/bash
148 10 Kerstin Paech
#SBATCH --output=slurm.out
149 10 Kerstin Paech
#SBATCH --error=slurm.err
150 10 Kerstin Paech
#SBATCH --mail-user <put your email address here>
151 10 Kerstin Paech
#SBATCH --mail-type=BEGIN
152 57 Martin Kuemmel
#SBATCH --account [cosmo_debug/euclid_debug]
153 10 Kerstin Paech
#SBATCH -p debug
154 10 Kerstin Paech
155 10 Kerstin Paech
/bin/hostname
156 10 Kerstin Paech
</pre>
157 10 Kerstin Paech
158 22 Kerstin Paech
h3. Accessing a node where a job is running or starting additional processes on a node
159 22 Kerstin Paech
160 25 Kerstin Paech
You can attach an srun command to an already existing job (batch or interactive). This
161 22 Kerstin Paech
means you can start an interactive session on a node where a job of yours is running
162 26 Kerstin Paech
or start an additional process.
163 22 Kerstin Paech
164 22 Kerstin Paech
First determine the jobid of the desired job using squeue, then use 
165 22 Kerstin Paech
166 22 Kerstin Paech
<pre>
167 22 Kerstin Paech
srun  --jobid <jobid> [options] <executable> 
168 22 Kerstin Paech
</pre>
169 22 Kerstin Paech
Or more concrete
170 22 Kerstin Paech
<pre>
171 22 Kerstin Paech
srun  --jobid <jobid> -u --pty  bash # to start an interactive session
172 22 Kerstin Paech
srun  --jobid <jobid> ps -eaFAl  # to start get detailed process information 
173 22 Kerstin Paech
</pre>
174 22 Kerstin Paech
175 24 Kerstin Paech
The processes will only run on cores that have been allocated to you. This works 
176 24 Kerstin Paech
for batch as well as interactive jobs. 
177 23 Kerstin Paech
*Important: If the original job that was submitted is finished, any process 
178 23 Kerstin Paech
attached in this fashion will be killed.*
179 22 Kerstin Paech
180 10 Kerstin Paech
181 6 Kerstin Paech
h3. Batch script for running a multi-core job
182 6 Kerstin Paech
183 61 Martin Kuemmel
mpi is installed on cosmofs1.
184 17 Kerstin Paech
185 18 Kerstin Paech
To run a 4 core job for an executable compiled with mpi you can use
186 6 Kerstin Paech
<pre>
187 6 Kerstin Paech
#!/bin/bash
188 6 Kerstin Paech
#SBATCH --output=slurm.out
189 6 Kerstin Paech
#SBATCH --error=slurm.err
190 6 Kerstin Paech
#SBATCH --mail-user <put your email address here>
191 6 Kerstin Paech
#SBATCH --mail-type=BEGIN
192 6 Kerstin Paech
#SBATCH -n 4
193 1 Kerstin Paech
194 18 Kerstin Paech
mpirun <programname>
195 1 Kerstin Paech
196 1 Kerstin Paech
</pre>
197 18 Kerstin Paech
and it will automatically start on the number of nodes specified.
198 1 Kerstin Paech
199 18 Kerstin Paech
To ensure that the job is being executed on only one node, add
200 18 Kerstin Paech
<pre>
201 18 Kerstin Paech
#SBATCH -n 4
202 18 Kerstin Paech
</pre>
203 18 Kerstin Paech
to the job script.
204 17 Kerstin Paech
205 19 Kerstin Paech
If you would like to run a program that itself starts processes, you can use the
206 19 Kerstin Paech
environment variable $SLURM_NPROCS that is automatically defined for slurm
207 19 Kerstin Paech
jobs to explicitly pass the number of cores the program can run on.
208 19 Kerstin Paech
209 17 Kerstin Paech
To check if your job is acutally running on the specified number of cores, you can check
210 17 Kerstin Paech
the PSR column of
211 17 Kerstin Paech
<pre>
212 17 Kerstin Paech
ps -eaFAl
213 17 Kerstin Paech
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
214 6 Kerstin Paech
</pre>
215 27 Jiayi Liu
216 28 Kerstin Paech
h3. environment for jobs
217 27 Jiayi Liu
218 29 Kerstin Paech
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
219 29 Kerstin Paech
220 28 Kerstin Paech
To use your usual system environment, add the following line in the submission script:
221 27 Jiayi Liu
<pre>
222 27 Jiayi Liu
#SBATCH --get-user-env
223 1 Kerstin Paech
</pre>
224 1 Kerstin Paech
225 58 Martin Kuemmel
h2. desdb node
226 58 Martin Kuemmel
227 58 Martin Kuemmel
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
228 28 Kerstin Paech
229 28 Kerstin Paech
h2. Software specific setup
230 28 Kerstin Paech
231 28 Kerstin Paech
h3. Python environment 
232 28 Kerstin Paech
233 28 Kerstin Paech
You can use the python 2.7.3 installed on the euclides cluster by using
234 27 Jiayi Liu
235 27 Jiayi Liu
<pre>
236 27 Jiayi Liu
source /data2/users/ccsoft/etc/setup_all
237 37 Kerstin Paech
source  /data2/users/ccsoft/etc/setup_python2.7.3
238 33 Shantanu Desai
</pre>
239 32 Shantanu Desai
240 32 Shantanu Desai
241 34 Shantanu Desai
h2. Notes For Euclid users
242 32 Shantanu Desai
243 35 Shantanu Desai
For those submitting jobs to euclides* nodes through Cosmo DM pipeline  here are some things which need to be specified for customized job submissions,
244 35 Shantanu Desai
since a different interface to slurm is used.
245 34 Shantanu Desai
246 34 Shantanu Desai
* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
247 34 Shantanu Desai
case you want to use it for all blocks)
248 34 Shantanu Desai
249 34 Shantanu Desai
* If you want to run on multiple cores/cores then use 
250 34 Shantanu Desai
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
251 1 Kerstin Paech
to use it for all blocks.
252 34 Shantanu Desai
253 35 Shantanu Desai
* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
254 39 Shantanu Desai
255 61 Martin Kuemmel
* note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
256 45 Roy Henderson
257 45 Roy Henderson
h1. Admin
258 45 Roy Henderson
259 49 Martin Kuemmel
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some script for adding a user and similar things are in "/data1/users/slurm". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
260 48 Martin Kuemmel
261 63 Martin Kuemmel
262 63 Martin Kuemmel
h2. Slurm configuration
263 63 Martin Kuemmel
264 63 Martin Kuemmel
h3. Slurm configuration file
265 63 Martin Kuemmel
266 72 Martin Kuemmel
The currently valid version of the configuration file are "/data1/users/slurm/slurm.conf" and "/data1/users/slurm/cosmo/slurm.conf" on cosmofs1 and cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used. 
267 63 Martin Kuemmel
268 63 Martin Kuemmel
The script 
269 63 Martin Kuemmel
270 63 Martin Kuemmel
* copies the configuration file to the submit node and restarts the submit service;
271 63 Martin Kuemmel
* copies the configuration file to all computing nodes and triggers the reconfiguration there;
272 63 Martin Kuemmel
273 1 Kerstin Paech
Then the slurm daemon needs to be started on the submit and all computing nodes with the script "restart.sh". 
274 72 Martin Kuemmel
275 72 Martin Kuemmel
*Note:* Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
276 72 Martin Kuemmel
277 63 Martin Kuemmel
278 62 Martin Kuemmel
h2. User management
279 1 Kerstin Paech
280 62 Martin Kuemmel
h3. Overview over users, accounts, etc.
281 62 Martin Kuemmel
282 50 Sebastian Bocquet
No sudo access needed:
283 50 Sebastian Bocquet
<pre>
284 50 Sebastian Bocquet
/usr/local/bin/sacctmgr show account withassoc
285 1 Kerstin Paech
</pre>
286 1 Kerstin Paech
287 62 Martin Kuemmel
h3. Adding a new user
288 45 Roy Henderson
289 62 Martin Kuemmel
As root on @cosmofs1@,
290 45 Roy Henderson
291 45 Roy Henderson
<pre>
292 55 Sebastian Bocquet
cd /data1/users/slurm/
293 1 Kerstin Paech
./add_user.sh UserName account(cosmo or euclid)
294 45 Roy Henderson
/usr/local/bin/.scontrol reconfigure
295 45 Roy Henderson
</pre>
296 62 Martin Kuemmel
297 45 Roy Henderson
h3. To increase memory, cores etc for a user
298 45 Roy Henderson
299 45 Roy Henderson
Inside script above, various commands for changing user settings, e.g.
300 1 Kerstin Paech
301 1 Kerstin Paech
<pre>
302 1 Kerstin Paech
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpCPUs=32
303 45 Roy Henderson
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpMem=128000
304 45 Roy Henderson
</pre>
305 62 Martin Kuemmel
306 62 Martin Kuemmel
h2. Trouble shooting
307 1 Kerstin Paech
308 63 Martin Kuemmel
h3. Information on a particular node
309 1 Kerstin Paech
310 63 Martin Kuemmel
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
311 63 Martin Kuemmel
312 63 Martin Kuemmel
h3. Node in state "drain"
313 63 Martin Kuemmel
314 50 Sebastian Bocquet
When a node is in "drain" state when calling <pre>sinfo</pre>
315 50 Sebastian Bocquet
run
316 50 Sebastian Bocquet
<pre>
317 50 Sebastian Bocquet
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
318 50 Sebastian Bocquet
</pre>
319 50 Sebastian Bocquet
to put it back to operation.
320 48 Martin Kuemmel
321 48 Martin Kuemmel
h2. Nodes down
322 48 Martin Kuemmel
323 1 Kerstin Paech
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some "troubleshooting":https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
324 63 Martin Kuemmel
325 76 Martin Kuemmel
If a job does not finish and remains int eh state "CG" then the sequence:
326 76 Martin Kuemmel
<pre>
327 76 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides13-os State=down Reason=hung_proc
328 76 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides13-os State=resume Reason=hung_proc
329 76 Martin Kuemmel
</pre>
330 76 Martin Kuemmel
brings the node back again.
331 76 Martin Kuemmel
332 1 Kerstin Paech
h2. History
333 65 Martin Kuemmel
334 73 Martin Kuemmel
* May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are *not* DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
335 73 Martin Kuemmel
336 66 Martin Kuemmel
* May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command  "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
337 66 Martin Kuemmel
338 65 Martin Kuemmel
* April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
339 63 Martin Kuemmel
340 63 Martin Kuemmel
* April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
341 63 Martin Kuemmel
342 63 Martin Kuemmel
* April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
343 63 Martin Kuemmel
344 63 Martin Kuemmel
* March 29th 2017: euclides7 is in drain state;  "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
345 63 Martin Kuemmel
346 63 Martin Kuemmel
* March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;
Redmine Appliance - Powered by TurnKey Linux