Project

General

Profile

Slurm » History » Version 134

Martin Kuemmel, 01/11/2024 12:38 PM

1 21 Kerstin Paech
{{toc}}
2 21 Kerstin Paech
3 92 Martin Kuemmel
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
4 92 Martin Kuemmel
5 53 Sebastian Bocquet
h1. Hardware overview
6 53 Sebastian Bocquet
7 90 Martin Kuemmel
You access the Euclid cluster through cosmogw.kosmo.physik.uni-muenchen.de
8 67 Martin Kuemmel
9 90 Martin Kuemmel
* cosmogw is a gateway machines and should *not* be used for computing
10 90 Martin Kuemmel
* there are 21 compute nodes named euclides01--euclides11 and euclides12--euclides21;
11 77 Martin Kuemmel
* euclides01-euclides11 have each 32 logical CPUs and 64GB of RAM;
12 77 Martin Kuemmel
* euclides12-euclides21 have each 56 logical CPUs and 128GB of RAM;
13 53 Sebastian Bocquet
14 1 Kerstin Paech
h1. How to run jobs on the euclides nodes (using Slurm)
15 1 Kerstin Paech
16 9 Kerstin Paech
Use slurm to submit jobs or login to the euclides nodes (euclides01-21).
17 9 Kerstin Paech
18 90 Martin Kuemmel
h2. Control node cosmogw
19 1 Kerstin Paech
20 1 Kerstin Paech
The machine cosmogw is the login node and submit nodes for the slurm queue, so please do not use them as a simple compute nodes - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
21 90 Martin Kuemmel
22 1 Kerstin Paech
You should use cosmogw to:
23 92 Martin Kuemmel
* transfer files;
24 92 Martin Kuemmel
* develop your code;
25 92 Martin Kuemmel
* compile your code;
26 92 Martin Kuemmel
* submit jobs to the nodes via the slurm queues;
27 51 Sebastian Bocquet
28 51 Sebastian Bocquet
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
29 51 Sebastian Bocquet
30 51 Sebastian Bocquet
h2. euclides nodes
31 1 Kerstin Paech
32 1 Kerstin Paech
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). 
33 1 Kerstin Paech
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin*
34 4 Kerstin Paech
35 92 Martin Kuemmel
All slurm commands listed below have very helpful man pages (e.g. 'man slurm', 'man squeue', ...). 
36 1 Kerstin Paech
37 4 Kerstin Paech
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
38 1 Kerstin Paech
39 75 Martin Kuemmel
h3. Scheduling of Jobs
40 69 Martin Kuemmel
41 92 Martin Kuemmel
At this point there are four queues, called partitions in slurm:
42 92 Martin Kuemmel
* on cosmogw:
43 1 Kerstin Paech
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
44 105 Martin Kuemmel
four days; this queue comprises the computing nodes euclides01-19;
45 105 Martin Kuemmel
** the *lowpri* partition also comprises the computing nodes euclides01-19; it is a so called preempty queue, allowing more resources for the users; however jobs are re-queued (canceled and re-scheduled) if the resources are demanded on the normal queue;
46 105 Martin Kuemmel
** *eucliddevel* which is intended for software development, especially if he 'normal' is full; this queue comprises the computing nodes euclides20-21; people from the Euclid group have an account on this queue; each user is allowed to use up to 56 cpus;
47 105 Martin Kuemmel
** *cosmodevel* which is intended for software development, especially if he 'normal' is full; this queue comprises the computing nodes euclides20-21; people from the cosmology group have an account on this queue; each user is allowed to use up to 4 cpus; note that this queue is preempt, meaning that the users of the queue eucliddevel precedence;
48 1 Kerstin Paech
49 92 Martin Kuemmel
50 38 Kerstin Paech
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
51 9 Kerstin Paech
52 9 Kerstin Paech
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
53 9 Kerstin Paech
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
54 9 Kerstin Paech
resources it will consume.
55 9 Kerstin Paech
56 92 Martin Kuemmel
Job scheduling is a complex issue and we can and have to adjust to the users need whenever possible. Please feel free to speak out if
57 9 Kerstin Paech
there is something that can be improved without creating an unfair disadvantage for other users.
58 9 Kerstin Paech
59 92 Martin Kuemmel
You can run interactive jobs on all partitions.
60 9 Kerstin Paech
61 41 Kerstin Paech
h3. Running an interactive job with slurm (a.k.a. logging in)
62 1 Kerstin Paech
63 9 Kerstin Paech
To run an interactive job with slurm in the default partition, use
64 1 Kerstin Paech
65 1 Kerstin Paech
<pre>
66 14 Kerstin Paech
srun -u --pty bash
67 1 Kerstin Paech
</pre>
68 9 Kerstin Paech
69 1 Kerstin Paech
If you want to use tcsh use
70 1 Kerstin Paech
71 1 Kerstin Paech
<pre>
72 1 Kerstin Paech
srun -u --pty tcsh
73 1 Kerstin Paech
</pre>
74 1 Kerstin Paech
75 15 Shantanu Desai
If you want to use a larger memory per job do
76 15 Shantanu Desai
77 15 Shantanu Desai
<pre>
78 15 Shantanu Desai
srun -u --mem-per-cpu=8000 --pty tcsh
79 30 Shantanu Desai
</pre>
80 20 Kerstin Paech
81 1 Kerstin Paech
In case you want to open x11 applications, use the --x11=first option, e.g.
82 20 Kerstin Paech
<pre>
83 20 Kerstin Paech
srun --x11=first -u --pty  bash
84 92 Martin Kuemmel
</pre>
85 92 Martin Kuemmel
86 92 Martin Kuemmel
Opening an interactive session on one of the development partitions is done with:
87 92 Martin Kuemmel
<pre>
88 92 Martin Kuemmel
srun --account=<euclid_dev/cosmo_dev> --partition=<eucliddevel/cosmodevel> --x11=first -u --pty  bash
89 12 Kerstin Paech
</pre>
90 1 Kerstin Paech
91 44 Kerstin Paech
h3. limited ssh access
92 44 Kerstin Paech
93 44 Kerstin Paech
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
94 44 Kerstin Paech
95 77 Martin Kuemmel
h3. Running a simple one core batch job with slurm using the default partition
96 1 Kerstin Paech
97 1 Kerstin Paech
* To see what queues are available to you (called partitions in slurm), run:
98 1 Kerstin Paech
<pre>
99 1 Kerstin Paech
sinfo
100 1 Kerstin Paech
</pre>
101 1 Kerstin Paech
102 1 Kerstin Paech
* To run slurm, create a myjob.slurm containing the following information:
103 1 Kerstin Paech
<pre>
104 1 Kerstin Paech
#!/bin/bash
105 129 Martin Kuemmel
#SBATCH --output=slurm.%N.%j.out
106 129 Martin Kuemmel
#SBATCH --error=slurm.%N.%j.err
107 1 Kerstin Paech
#SBATCH --mail-user <put your email address here>
108 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
109 8 Kerstin Paech
#SBATCH -p normal
110 91 Martin Kuemmel
#SBATCH --ntasks=1
111 1 Kerstin Paech
112 1 Kerstin Paech
/bin/hostname
113 1 Kerstin Paech
</pre>
114 129 Martin Kuemmel
Note that the %N and %j resolves in the node name and slurm job ID number (e.g. "slurm.euclides09.461336.out"), respectively. This information as part of the logging files automatically generates run specific log files that are not overwritten by the next runs and are then available for a long term evaluation of the slurm runs.
115 1 Kerstin Paech
116 1 Kerstin Paech
* To submit a batch job use:
117 1 Kerstin Paech
<pre>
118 1 Kerstin Paech
sbatch myjob.slurm
119 1 Kerstin Paech
</pre>
120 1 Kerstin Paech
121 1 Kerstin Paech
* To see the status of you job, use 
122 1 Kerstin Paech
<pre>
123 1 Kerstin Paech
squeue
124 1 Kerstin Paech
</pre>
125 1 Kerstin Paech
126 11 Kerstin Paech
* To kill a job use:
127 11 Kerstin Paech
<pre>
128 11 Kerstin Paech
scancel <jobid>
129 11 Kerstin Paech
</pre> the <jobid> you can get from using squeue.
130 1 Kerstin Paech
131 1 Kerstin Paech
* For some more information on your job use
132 11 Kerstin Paech
<pre>
133 1 Kerstin Paech
scontrol show job <jobid>
134 11 Kerstin Paech
</pre>the <jobid> you can get from using squeue.
135 1 Kerstin Paech
136 77 Martin Kuemmel
h3. Running a simple once core batch job with slurm using the lowpri partition
137 10 Kerstin Paech
138 77 Martin Kuemmel
Change the partition to lowpri and add the appropriate account depending if you're part of
139 10 Kerstin Paech
the euclid or cosmology group.
140 10 Kerstin Paech
141 10 Kerstin Paech
<pre>
142 10 Kerstin Paech
#!/bin/bash
143 10 Kerstin Paech
#SBATCH --output=slurm.out
144 10 Kerstin Paech
#SBATCH --error=slurm.err
145 10 Kerstin Paech
#SBATCH --mail-user <put your email address here>
146 10 Kerstin Paech
#SBATCH --mail-type=BEGIN
147 77 Martin Kuemmel
#SBATCH --account=[euclid_lowpri/cosmo_lowpri]
148 77 Martin Kuemmel
#SBATCH --partition=lowpri
149 91 Martin Kuemmel
#SBATCH --ntasks=1
150 10 Kerstin Paech
151 10 Kerstin Paech
/bin/hostname
152 10 Kerstin Paech
</pre>
153 10 Kerstin Paech
154 22 Kerstin Paech
h3. Accessing a node where a job is running or starting additional processes on a node
155 22 Kerstin Paech
156 25 Kerstin Paech
You can attach an srun command to an already existing job (batch or interactive). This
157 22 Kerstin Paech
means you can start an interactive session on a node where a job of yours is running
158 26 Kerstin Paech
or start an additional process.
159 22 Kerstin Paech
160 22 Kerstin Paech
First determine the jobid of the desired job using squeue, then use 
161 22 Kerstin Paech
162 22 Kerstin Paech
<pre>
163 22 Kerstin Paech
srun  --jobid <jobid> [options] <executable> 
164 22 Kerstin Paech
</pre>
165 22 Kerstin Paech
Or more concrete
166 22 Kerstin Paech
<pre>
167 22 Kerstin Paech
srun  --jobid <jobid> -u --pty  bash # to start an interactive session
168 22 Kerstin Paech
srun  --jobid <jobid> ps -eaFAl  # to start get detailed process information 
169 22 Kerstin Paech
</pre>
170 22 Kerstin Paech
171 24 Kerstin Paech
The processes will only run on cores that have been allocated to you. This works 
172 24 Kerstin Paech
for batch as well as interactive jobs. 
173 23 Kerstin Paech
*Important: If the original job that was submitted is finished, any process 
174 23 Kerstin Paech
attached in this fashion will be killed.*
175 22 Kerstin Paech
176 10 Kerstin Paech
177 6 Kerstin Paech
h3. Batch script for running a multi-core job
178 6 Kerstin Paech
179 61 Martin Kuemmel
mpi is installed on cosmofs1.
180 17 Kerstin Paech
181 18 Kerstin Paech
To run a 4 core job for an executable compiled with mpi you can use
182 6 Kerstin Paech
<pre>
183 6 Kerstin Paech
#!/bin/bash
184 6 Kerstin Paech
#SBATCH --output=slurm.out
185 6 Kerstin Paech
#SBATCH --error=slurm.err
186 6 Kerstin Paech
#SBATCH --mail-user <put your email address here>
187 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
188 91 Martin Kuemmel
#SBATCH --ntasks=4
189 1 Kerstin Paech
190 18 Kerstin Paech
mpirun <programname>
191 1 Kerstin Paech
192 1 Kerstin Paech
</pre>
193 18 Kerstin Paech
and it will automatically start on the number of nodes specified.
194 1 Kerstin Paech
195 18 Kerstin Paech
To ensure that the job is being executed on only one node, add
196 18 Kerstin Paech
<pre>
197 18 Kerstin Paech
#SBATCH -n 4
198 18 Kerstin Paech
</pre>
199 18 Kerstin Paech
to the job script.
200 17 Kerstin Paech
201 19 Kerstin Paech
If you would like to run a program that itself starts processes, you can use the
202 19 Kerstin Paech
environment variable $SLURM_NPROCS that is automatically defined for slurm
203 19 Kerstin Paech
jobs to explicitly pass the number of cores the program can run on.
204 19 Kerstin Paech
205 17 Kerstin Paech
To check if your job is acutally running on the specified number of cores, you can check
206 17 Kerstin Paech
the PSR column of
207 17 Kerstin Paech
<pre>
208 17 Kerstin Paech
ps -eaFAl
209 17 Kerstin Paech
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
210 6 Kerstin Paech
</pre>
211 27 Jiayi Liu
212 28 Kerstin Paech
h3. environment for jobs
213 27 Jiayi Liu
214 29 Kerstin Paech
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
215 29 Kerstin Paech
216 28 Kerstin Paech
To use your usual system environment, add the following line in the submission script:
217 27 Jiayi Liu
<pre>
218 27 Jiayi Liu
#SBATCH --get-user-env
219 1 Kerstin Paech
</pre>
220 1 Kerstin Paech
221 87 Martin Kuemmel
h3. Slurm reporting and accounting
222 87 Martin Kuemmel
223 88 Martin Kuemmel
For information on job usage and cluster utilization for slurm jobs the slurm command "sreport" can be used. E.g. the command:
224 87 Martin Kuemmel
<pre>
225 87 Martin Kuemmel
sreport user topusage start=01/15/18 -t percent
226 87 Martin Kuemmel
</pre>
227 88 Martin Kuemmel
shows the top ten users in percent since January 15th 2018. For more information please look at "man sreport".
228 87 Martin Kuemmel
229 88 Martin Kuemmel
For accounting on specific jobs the slurm command "sacct" can be used. E.g. the command:
230 87 Martin Kuemmel
<pre>
231 87 Martin Kuemmel
sacct -j 18551 --format=JobID,JobName,MaxRSS,Elapsed
232 87 Martin Kuemmel
</pre>
233 131 Martin Kuemmel
displays information (elapsed time, memory usage, ...) on the job number "18551". 
234 131 Martin Kuemmel
235 131 Martin Kuemmel
As usual with slurm commands there is a ton of options to re-define the search and format the desired output. The command:
236 131 Martin Kuemmel
<pre>
237 131 Martin Kuemmel
sacct --format Start,End,User,JobID,state,partition  -u <name>  --starttime 2023-07-01
238 131 Martin Kuemmel
</pre>
239 131 Martin Kuemmel
lists all jobs of the user <user> since 1st July 2023. The information provided includes the end state and the end time and so on.
240 131 Martin Kuemmel
For more details please  use "man sacct".
241 131 Martin Kuemmel
242 131 Martin Kuemmel
243 87 Martin Kuemmel
244 78 Martin Kuemmel
h3. Some points on the 'normal' versus 'lowpri' queue on cosmogw
245 78 Martin Kuemmel
246 93 Martin Kuemmel
The allowances for each user on the *normal* partition are 304CPU's and 661335MB, which corresponds to about 1/3 of the entire cluster (euclides01-19). In short, every user is allowed to use up to 1/3 of the cluster in the normal partition.
247 78 Martin Kuemmel
248 78 Martin Kuemmel
On the partition *lowpri* (for low priority) there are no limits on the CPU numbers or RAM consumption, meaning the user can take all available resources up to the *entire* cluster! However, jobs on the partition "lowpri" have a lower priority through the so called preemption mechanism. This means if all nodes are busy (partially through the lowpri queue) and an additional job is submitted to the "normal" partition, slurm will re-queue (meaning cancel and re-schedule to the lowpri-queue) job(s) on the "lowpri" partition to get the job on the "normal" partition running.
249 78 Martin Kuemmel
250 78 Martin Kuemmel
Here is an example scenario to illustrate the opportunities the "lowpri" partition offers:
251 93 Martin Kuemmel
I want to submit a number of jobs for in total 744cpu's. The entire cluster has 744 cpu's in total, this means in the optimal case I get 1/3 of the cluster on the "normal" partition, and it takes at least three cycles to get all my jobs finished. However, if I submit to the "lowpri" partition, in the case of an empty cluster I can use the *entire* cluster and finish in only one cycle. Of course it may happen that other users submit lots of jobs to the "normal" partition afterwards and many of my jobs are being re-queued. That would then delay the finishing of my jobs on the "lowpri" partition correspondingly. To highlight some aspects of using the "lowpri" partition:
252 78 Martin Kuemmel
253 78 Martin Kuemmel
* it is relevant especially when you want to submit several jobs that significantly exceed the user allowance on the "normal" partition and need the entire cluster to get finished;
254 78 Martin Kuemmel
* on average, the available ressources on the "lowpri" partition are much *larger* than on the "normal" partition, especially during the night or on the weekend;
255 78 Martin Kuemmel
* please not that *no job gets ever lost* at the "lowpri" partition; if re-queuing occurs, the user gets an email (Subject: "SLURM Job_id=2563 Name=test_mpi_gather.slurm Failed, Run time 00:01:58, PREEMPTED, ExitCode 0") when the job is stopped and subsequently when it starts again and when it finishes (see 1.);
256 78 Martin Kuemmel
* also on the "lowpri" partition there is a queue which decides which job comes first (of course only in the case of an oversubscription);
257 78 Martin Kuemmel
* the preemption mechanism tries to minimize the number of re-queued jobs necessary to get the job in the "normal" partition going; so, if 8 cpus are requested and the "lowpri" partion contains one job using 8 cpus, three jobs using 4 cpus and several dozens jobs using 1 cpu, only the job with 8 cpus is re-scheduled independent on the run times and other parameters.
258 78 Martin Kuemmel
259 79 Martin Kuemmel
To submit a job to the "lowpri" partition please insert the following lines into the slurm batch script (see also example above):
260 79 Martin Kuemmel
<pre>
261 93 Martin Kuemmel
#SBATCH --account=<your account>
262 79 Martin Kuemmel
#SBATCH -p lowpri
263 79 Martin Kuemmel
</pre>
264 79 Martin Kuemmel
265 79 Martin Kuemmel
with <your_acount> being either "cosmo_lowpri" or "euclid_lowpri".
266 79 Martin Kuemmel
267 80 Martin Kuemmel
There are two typical scenarios where a user can gain from the lowpri queue:
268 80 Martin Kuemmel
* if a job stores intermediate results at regular intervals and picks up from there once started again; then even a long job looses only the computing time since the last storage point if a job is re-scheduled;
269 80 Martin Kuemmel
* if a single job needs only a small amount of computing time (perhaps <12h) but a lot of jobs need to be run; then the loss of computing time is rather small if a job is re-scheduled;
270 80 Martin Kuemmel
271 58 Martin Kuemmel
h2. desdb node
272 58 Martin Kuemmel
273 58 Martin Kuemmel
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
274 28 Kerstin Paech
275 28 Kerstin Paech
h2. Software specific setup
276 28 Kerstin Paech
277 28 Kerstin Paech
h3. Python environment 
278 28 Kerstin Paech
279 28 Kerstin Paech
You can use the python 2.7.3 installed on the euclides cluster by using
280 27 Jiayi Liu
281 27 Jiayi Liu
<pre>
282 27 Jiayi Liu
source /data2/users/ccsoft/etc/setup_all
283 37 Kerstin Paech
source  /data2/users/ccsoft/etc/setup_python2.7.3
284 33 Shantanu Desai
</pre>
285 32 Shantanu Desai
286 32 Shantanu Desai
287 34 Shantanu Desai
h2. Notes For Euclid users
288 32 Shantanu Desai
289 35 Shantanu Desai
For those submitting jobs to euclides* nodes through Cosmo DM pipeline  here are some things which need to be specified for customized job submissions,
290 35 Shantanu Desai
since a different interface to slurm is used.
291 34 Shantanu Desai
292 34 Shantanu Desai
* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
293 34 Shantanu Desai
case you want to use it for all blocks)
294 34 Shantanu Desai
295 34 Shantanu Desai
* If you want to run on multiple cores/cores then use 
296 34 Shantanu Desai
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
297 1 Kerstin Paech
to use it for all blocks.
298 34 Shantanu Desai
299 35 Shantanu Desai
* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
300 39 Shantanu Desai
301 61 Martin Kuemmel
* note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
302 45 Roy Henderson
303 107 Martin Kuemmel
h1. Running specific applications on the cluster
304 107 Martin Kuemmel
305 107 Martin Kuemmel
h2. idl
306 107 Martin Kuemmel
307 107 Martin Kuemmel
We do have idl installed on our cluster with sufficient licences. Running idl on a computing node requires a specific setup plus then user specific adjustements. This process is too complicated to explain here. Please ask Matthias Klein, our local expert, if you want to use with idl on the cluster.
308 107 Martin Kuemmel
309 107 Martin Kuemmel
h2. jupyter notebook
310 107 Martin Kuemmel
311 107 Martin Kuemmel
Jupyter notebook is a very handy tool to do prototyping in the software development process. Jupyter notebook is a client-server process, where the client runs on a browser. On the cluster it is posssible to run the jupyter notebook on the server cosmogw but have the client run locally on your laptop. 
312 107 Martin Kuemmel
313 107 Martin Kuemmel
With this approach the user has on the one hand the cluster environment for processing and file storage and on the other hand a convenient development environment on the local browser with a minimum of data transfer.
314 107 Martin Kuemmel
315 107 Martin Kuemmel
To run jupyter notebook in this setup you need to (follwing the process "here":https://ljvmiranda921.github.io/notebook/2018/01/31/running-a-jupyter-notebook/) :
316 107 Martin Kuemmel
317 107 Martin Kuemmel
* From you local host connect to cosmogw: <pre>local$ ssh -Y <user>@cosmogw.kosmo.physik.uni-muenchen.de</pre>
318 107 Martin Kuemmel
* On cosmogw, open jupyer notebook for a specific port: <pre>cosmogw$ jupyter notebook --no-browser --port=<your remote port></pre>
319 107 Martin Kuemmel
* On the local host, forward the port <your remote port> to the port <your local port> <pre>local$ ssh -N -f -L localhost:<your local port>:localhost:<your remote port> <user>@cosmogw.kosmo.physik.uni-muenchen.de</pre>
320 1 Kerstin Paech
* Now you can connect to the jupyter notebook server running on cosmogw by connecting in your browser <pre>localhost:<your local port></pre>
321 107 Martin Kuemmel
* When you do the the first time, you may have to authenticate yourself using the token shown when firing up the jupyter notebook on cosmogw
322 108 Martin Kuemmel
* While the local port number <your local port> is alsmost arbitrary (well, whould not be used on your local machine by other services), the remote port number <your remote port> needs to be unique in order to not interfere with other users. I would recommend to always use the *same <your remote port>* and to use the number 8000+"your birth day" (which would be 8014 for me) to generate some kind of a random numbers for all users. 
323 107 Martin Kuemmel
* After you are done, please do not forget to kill the port forwarding process on your local machine. you can find the relevant process number with <pre>ps -ef | grep <your local port></pre>
324 107 Martin Kuemmel
325 109 Martin Kuemmel
While this process seems to be a bit complicated , but with aliases and pre-defined functions in your .bashrc/.profilerc the setup becomes quite natural, and the speedup and convenience makes it worth in any case.
326 109 Martin Kuemmel
327 110 Martin Kuemmel
While jupyter notebooks are quite handy for prototyping or SW development no processing is allowed on cosmogw, since this puts too much load onto this gateway machine! When going for processing or production, please export your code from jupyer notebook!
328 107 Martin Kuemmel
329 120 Martin Kuemmel
h1. Connecting to cosmogw
330 120 Martin Kuemmel
331 120 Martin Kuemmel
h2. How to setup a VNC connection to cosmogw
332 120 Martin Kuemmel
333 120 Martin Kuemmel
A Virtual Network Connection offers a convenient and fast way to connect to cosmogw. The user sets up a desktop on cosmogw and then connects directly to this laptop. Aplications such as xterms, editors or browsers are kept in between the connections and the time delay is indeed minimal. It is like working on your own desktop.
334 120 Martin Kuemmel
335 120 Martin Kuemmel
To setup the VNC connection you have to do the following:
336 120 Martin Kuemmel
337 120 Martin Kuemmel
* ssh to from your laptop to cosmogw or any other access machine:
338 120 Martin Kuemmel
 <pre>$ ssh <your name>@cosmogw.kosmo.physik.uni-muenchen.de</pre>
339 120 Martin Kuemmel
340 120 Martin Kuemmel
* on cosmogw start a vnc session using:
341 120 Martin Kuemmel
<pre>$ vncserver</pre>
342 120 Martin Kuemmel
  Some notes on this:
343 120 Martin Kuemmel
344 120 Martin Kuemmel
  ** the first time you have to pick a password which will be asked for when establishing a remote connection;
345 120 Martin Kuemmel
  ** the comand 'vncserver' has lots of options, such as the geometry of the desktop  'vncserver-geometry 2500x1400'
346 120 Martin Kuemmel
  ** logs and more information on the current VNC is available on '$HOME/./vnc/xstartup'
347 120 Martin Kuemmel
348 120 Martin Kuemmel
* the command 'vncserver' gives a sesstion number which in this case is '3':
349 120 Martin Kuemmel
  $ vncserver
350 120 Martin Kuemmel
  New 'cosmogw:3 (<your name>)' desktop is cosmogw:3
351 120 Martin Kuemmel
352 120 Martin Kuemmel
  Starting applications specified in /home/<your name>/.vnc/xstartup
353 120 Martin Kuemmel
  Log file is /home/<your name>/.vnc/gatezero:3.log
354 120 Martin Kuemmel
  
355 120 Martin Kuemmel
* to be able connecting to this session you need to establish a ssh tunnel connection from you laptop to gatezero via:
356 120 Martin Kuemmel
<pre>$ssh -C  -L 5901:localhost:5903  -N -f -l <your name>@cosmogw.kosmo.physik.uni-muenchen.de</pre>
357 120 Martin Kuemmel
  There are some magical numbers:
358 120 Martin Kuemmel
359 120 Martin Kuemmel
  ** 5900 seems to be base port number for a VNC connection
360 120 Martin Kuemmel
  ** 5903 = 5900 + 3 connects to session 3 established with 'vncserver'
361 120 Martin Kuemmel
  ** 5901 = 5900 + 1 the send the connection to port 1 on your laptop (--> localhost:1)
362 120 Martin Kuemmel
363 120 Martin Kuemmel
* now you can start the client on your laptop and connect to localhost:1 (my client also accepts localhost:5902); you will be asked for the password delared for 'vncserver'.
364 120 Martin Kuemmel
365 120 Martin Kuemmel
* after this setup the next connection is established generating the tunnel and connecting with the client
366 120 Martin Kuemmel
367 120 Martin Kuemmel
* when you re-connect to the VNC the desktop is in the state you left it, meaning all the shells editors etc. ares still there.
368 120 Martin Kuemmel
369 120 Martin Kuemmel
* from that desktop you can start interactive slurm shells or start slurm scripts etc.;
370 120 Martin Kuemmel
371 45 Roy Henderson
h1. Admin
372 45 Roy Henderson
373 102 Martin Kuemmel
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some scripts re-starting slurm, adding a user and similar things are in "/data1/users/slurm/cosmo". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
374 63 Martin Kuemmel
375 63 Martin Kuemmel
h2. Slurm configuration
376 63 Martin Kuemmel
377 1 Kerstin Paech
h3. Slurm configuration file
378 63 Martin Kuemmel
379 102 Martin Kuemmel
The currently valid version of the configuration file is "/data1/users/slurm/cosmo/slurm.conf" on cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used. 
380 63 Martin Kuemmel
381 63 Martin Kuemmel
The script 
382 63 Martin Kuemmel
383 1 Kerstin Paech
* copies the configuration file to the submit node and restarts the submit service;
384 63 Martin Kuemmel
* copies the configuration file to all computing nodes and triggers the reconfiguration there;
385 63 Martin Kuemmel
386 102 Martin Kuemmel
Then the slurm daemon needs to be started on the submit node and all computing nodes with the script "restart.sh". 
387 72 Martin Kuemmel
388 72 Martin Kuemmel
*Note:* Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
389 72 Martin Kuemmel
390 62 Martin Kuemmel
h2. User management
391 1 Kerstin Paech
392 62 Martin Kuemmel
h3. Overview over users, accounts, etc.
393 62 Martin Kuemmel
394 50 Sebastian Bocquet
No sudo access needed:
395 50 Sebastian Bocquet
<pre>
396 50 Sebastian Bocquet
/usr/local/bin/sacctmgr show account withassoc
397 1 Kerstin Paech
</pre>
398 1 Kerstin Paech
399 62 Martin Kuemmel
h3. Adding a new user
400 45 Roy Henderson
401 62 Martin Kuemmel
As root on @cosmofs1@,
402 45 Roy Henderson
403 45 Roy Henderson
<pre>
404 55 Sebastian Bocquet
cd /data1/users/slurm/
405 1 Kerstin Paech
./add_user.sh UserName account(cosmo or euclid)
406 45 Roy Henderson
/usr/local/bin/.scontrol reconfigure
407 45 Roy Henderson
</pre>
408 62 Martin Kuemmel
409 45 Roy Henderson
h3. To increase memory, cores etc for a user
410 45 Roy Henderson
411 45 Roy Henderson
Inside script above, various commands for changing user settings, e.g.
412 1 Kerstin Paech
413 1 Kerstin Paech
<pre>
414 1 Kerstin Paech
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpCPUs=32
415 45 Roy Henderson
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpMem=128000
416 1 Kerstin Paech
</pre>
417 114 Martin Kuemmel
418 114 Martin Kuemmel
h3. Modifying a running job
419 114 Martin Kuemmel
420 114 Martin Kuemmel
I is possible to change the parameters of a running job with 'scontrol'. E.g. to allow for more time it is:
421 114 Martin Kuemmel
422 114 Martin Kuemmel
<pre>
423 114 Martin Kuemmel
/usr/local/bin/scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
424 114 Martin Kuemmel
</pre>
425 114 Martin Kuemmel
426 62 Martin Kuemmel
h2. Trouble shooting
427 1 Kerstin Paech
428 63 Martin Kuemmel
h3. Information on a particular node
429 1 Kerstin Paech
430 63 Martin Kuemmel
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
431 63 Martin Kuemmel
432 63 Martin Kuemmel
h3. Node in state "drain"
433 63 Martin Kuemmel
434 50 Sebastian Bocquet
When a node is in "drain" state when calling <pre>sinfo</pre>
435 50 Sebastian Bocquet
run
436 50 Sebastian Bocquet
<pre>
437 50 Sebastian Bocquet
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
438 50 Sebastian Bocquet
</pre>
439 50 Sebastian Bocquet
to put it back to operation.
440 48 Martin Kuemmel
441 94 Martin Kuemmel
h3. Restart
442 94 Martin Kuemmel
443 94 Martin Kuemmel
A full running of slurm requires:
444 94 Martin Kuemmel
445 94 Martin Kuemmel
* running the data base mysql;
446 94 Martin Kuemmel
* running the slurm data base daemon slurmdpd (for accounting);
447 94 Martin Kuemmel
* running slurmctld on cosmogw;
448 94 Martin Kuemmel
* slurmd on all nodes;
449 94 Martin Kuemmel
450 94 Martin Kuemmel
h4. mysql
451 94 Martin Kuemmel
452 103 Martin Kuemmel
Mysql is started with 'systemctl start mysql'. The log is in '/var/log/mysqld.log' At one re-start (January 2019) the log said "/usr/sbin/mysqld: Can't create/write to file '/var/run/mysqld/mysqld.pid'". Then '/var/run/mysqld' did not exist (somehow disappeared). It had to be created and given to the owner 'mysql'. Then the file mysql.pid is created and mysql seems to be working fine.
453 94 Martin Kuemmel
454 94 Martin Kuemmel
h4. slurmdbd
455 94 Martin Kuemmel
456 97 Martin Kuemmel
Should be started with 'systemctl status slurmdbd'. However this does not seem to work always (at least not on January 2019). It is possible to start the daemon directly with '/usr/local/sbin/slurmdbd'. The log of slurm is in '/var/log/slurm/slurmdbd.log'.
457 94 Martin Kuemmel
458 94 Martin Kuemmel
h4. slurmctld and slurmd
459 1 Kerstin Paech
460 1 Kerstin Paech
A re-start of the slurm daemons ('slurmctld' on cosmogw and 'slurmd' on the nodes) is done bye executing the script:
461 1 Kerstin Paech
/data1/users/slurm/cosmo/restart.sh
462 130 Martin Kuemmel
463 130 Martin Kuemmel
h4. Problems running MPI
464 130 Martin Kuemmel
465 130 Martin Kuemmel
Over years there gave been reports that some nodes can not be used by Markov Chain Monte Carlo jobs, since the jobs do not start. Looks like this might be cause by RAM used by RAM disk storage from processes long gone.
466 130 Martin Kuemmel
467 130 Martin Kuemmel
In March 2023 up to 12GB of the RAM for one node was locked by this. Often this cones from failed cosmodm pipeline runs. Supposedly cosmodm mops up RAM disk, but only backwards, in the sense that the RAM disk is cleaned at the *beginning* of a new run.
468 130 Martin Kuemmel
469 130 Martin Kuemmel
There is the script:
470 130 Martin Kuemmel
<pre>
471 130 Martin Kuemmel
toNodes.sh "command -opt1 v1 -opt2 v2 ..."
472 130 Martin Kuemmel
</pre>
473 130 Martin Kuemmel
executes a command on all cluster nodes. This helps to clean out the RAM disk and so on.
474 94 Martin Kuemmel
475 48 Martin Kuemmel
h2. Nodes down
476 48 Martin Kuemmel
477 1 Kerstin Paech
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some "troubleshooting":https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
478 63 Martin Kuemmel
479 76 Martin Kuemmel
If a job does not finish and remains int eh state "CG" then the sequence:
480 76 Martin Kuemmel
<pre>
481 98 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides01 State=down Reason=hung_proc
482 98 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides01 State=resume Reason=hung_proc
483 76 Martin Kuemmel
</pre>
484 76 Martin Kuemmel
brings the node back again.
485 76 Martin Kuemmel
486 1 Kerstin Paech
h2. History
487 89 Martin Kuemmel
488 117 Martin Kuemmel
h3. Incident by machine
489 117 Martin Kuemmel
490 123 Martin Kuemmel
|_. Node |_. Date |_. Reason|_.Solution |_.Comment |
491 134 Martin Kuemmel
| euclides13 | Jan 11th  24 |  drain | resume with scontrol | |
492 133 Martin Kuemmel
| euclides03 | Nov 8th  23 |  down* | kill and restart the slurmd, resume scontrol | |
493 132 Martin Kuemmel
| euclides06 | Nov 8th  23 |  draining | resume with scontrol | |
494 128 Martin Kuemmel
| euclides16 | February 4th  22 |  drain | resume | |
495 128 Martin Kuemmel
| euclides15 | February 4th  22 |  drain | resume | |
496 126 Martin Kuemmel
| euclides19 | October 15th  21 |  down* | restart and resume | |
497 124 Martin Kuemmel
| euclides05 | June 22nd  21 |  down* | restart and resume | |
498 121 Martin Kuemmel
| euclides01 | Feb. 9th  21 |  down* | restart and resume | |
499 118 Martin Kuemmel
| euclides14 | Dec. 7th  20 |  down | restart and resume | |
500 118 Martin Kuemmel
| euclides12 | Nov. 18th  20 |  down | restart and resume | |
501 118 Martin Kuemmel
| euclides14 | Nov. 4th  20 |  drain | resume | |
502 118 Martin Kuemmel
| euclides14 | Oct. 16th  20 |  draining |  | |
503 118 Martin Kuemmel
| euclides16 | Sept. 17th  20 |  down* |  restart and resume | |
504 118 Martin Kuemmel
| euclides01 | June. 30th  20 |  down* | restart and resume | |
505 118 Martin Kuemmel
| euclides02 | June. 30th  20 |  down |  resume | |
506 119 Martin Kuemmel
| euclides12 | Jan. 18th  18 |  CG |  specific procedure (see below) | |
507 117 Martin Kuemmel
508 117 Martin Kuemmel
h3. Detailed description of the incident
509 117 Martin Kuemmel
510 127 Martin Kuemmel
* October 15th 21: it was not possible to connect to euclides19. After a reboot (button "Power Cycle System") everything worked again. Looks like down* is really associated to kind of network problems;
511 127 Martin Kuemmel
512 125 Martin Kuemmel
* June 22nd 2021:euclides05 was in state "down*". Setting it to down and resume set the status to "idle*". Turns out that slurmd was no longer working and had to be restarted.
513 125 Martin Kuemmel
514 116 Martin Kuemmel
* December 7th 2020: euclides14 was in state "down". Root ssh was not longer possible. Re-started the machine and brought it back with "scontrol". The entire process was *really* slow. Also it looks like the machine was not responsive to my test jobs since some days. Maybe it went from running a job directly to "down".
515 116 Martin Kuemmel
516 115 Martin Kuemmel
* November 18th 2020: euclides12 was in state "down". Root ssh was possible but not as user. Re-started the machine and brought it back with "scontrol";
517 115 Martin Kuemmel
518 113 Martin Kuemmel
* November 4th 2020: euclides14 was in state "drain". Resumed with "scontrol";
519 112 Martin Kuemmel
520 112 Martin Kuemmel
* October 16th 2020: euclides14 was in state "draining". Resumed with "scontrol";
521 112 Martin Kuemmel
522 111 Martin Kuemmel
* September 17th 2020: euclides16 in state "down*" (unable to ssh to). euclides16 had to be rebooted;
523 111 Martin Kuemmel
524 106 Martin Kuemmel
* June 30th 2020: euclides01 in state "down*" (unable to ssh to) and euclides02 in state "down". euclides02 could be integrated into the queue via scontrol, euclides01 had to be rebooted;
525 106 Martin Kuemmel
526 104 Martin Kuemmel
* June 4th 2020: another user (Aditya) got the error (as below):
527 104 Martin Kuemmel
<pre>
528 104 Martin Kuemmel
$srun --x11=first -u -n 20 --mem=3000 --pty  bash
529 104 Martin Kuemmel
srun: error: plugin_load_from_file: dlopen(/usr/local/lib/slurm/select_cons_res.so): /usr/local/lib/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap
530 104 Martin Kuemmel
srun: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed
531 104 Martin Kuemmel
srun: fatal: Can't find plugin for select/cons_res
532 104 Martin Kuemmel
</pre>
533 104 Martin Kuemmel
That was a setup problem as well. However, just unsetting "export LD_BIND_NOW=0" did not do the job. The user had a setup for IDL and connecting with a virtual desktop. Moving to a setup derived from mine solved the problem. However it is possible to connect with virtual desktop and work with IDL in slurm;
534 104 Martin Kuemmel
535 100 Martin Kuemmel
* January 2nd 2020: one user (Thomas) got the error:
536 100 Martin Kuemmel
<pre>
537 99 Martin Kuemmel
[cosmogw][~] $ srun -u --mem-per-cpu=2000 --x11=first --cpus-per-task=56 --pty bash 
538 1 Kerstin Paech
srun: error: plugin_load_from_file: dlopen(/usr/local/lib/slurm/select_cons_res.so): /usr/local/lib/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap
539 99 Martin Kuemmel
srun: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed
540 100 Martin Kuemmel
srun: fatal: Can't find plugin for select/cons_res
541 101 Martin Kuemmel
</pre>
542 1 Kerstin Paech
Turns out that the problem was to set "export LD_BIND_NOW=1" for another issue. After unsetting this slurm worked normally.
543 104 Martin Kuemmel
544 104 Martin Kuemmel
* December 30th 2019: somehow the controler job slurmctld on cosmogw was down. After re-star everything was okay.
545 104 Martin Kuemmel
546 85 Martin Kuemmel
* January 23rd 2018: Jobs on euclides12 are no longer finishing. They end up in the state "CG" and hang there forever. In the slurmd log there is the entry "[2018-01-23T10:12:17.477] [18153] error: Unable to establish controller machine" basically every 15mins or so. ssh from euclides12 to cosmogw via name and IP address was possible, so it is difficult to interpret this error message. At the end the problem was solved by:
547 81 Martin Kuemmel
** stopping slurmd
548 81 Martin Kuemmel
** removing /var/run/slurmd.pid
549 81 Martin Kuemmel
** creating /var/run/slurmd.pid via touch
550 81 Martin Kuemmel
** re-starting slurmd again
551 86 Martin Kuemmel
** euclides12 had before this sometimes created problems, maybe this was the culmination now.
552 81 Martin Kuemmel
553 73 Martin Kuemmel
* May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are *not* DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
554 73 Martin Kuemmel
555 66 Martin Kuemmel
* May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command  "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
556 66 Martin Kuemmel
557 65 Martin Kuemmel
* April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
558 63 Martin Kuemmel
559 63 Martin Kuemmel
* April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
560 63 Martin Kuemmel
561 63 Martin Kuemmel
* April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
562 63 Martin Kuemmel
563 63 Martin Kuemmel
* March 29th 2017: euclides7 is in drain state;  "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
564 63 Martin Kuemmel
565 63 Martin Kuemmel
* March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;
Redmine Appliance - Powered by TurnKey Linux