- Hardware overview
- How to run jobs on the euclides nodes (using Slurm)
- Control node cosmogw
- euclides nodes
- Scheduling of Jobs
- Running an interactive job with slurm (a.k.a. logging in)
- limited ssh access
- Running a simple one core batch job with slurm using the default partition
- Running a simple once core batch job with slurm using the lowpri partition
- Accessing a node where a job is running or starting additional processes on a node
- Batch script for running a multi-core job
- environment for jobs
- Slurm reporting and accounting
- Some points on the 'normal' versus 'lowpri' queue on cosmogw
- desdb node
- Software specific setup
- Notes For Euclid users
- Running specific applications on the cluster
- Connecting to cosmogw
- Admin
Please read through this entire wikipage so everyone can make efficient use of this cluster
Hardware overview¶
You access the Euclid cluster through cosmogw.kosmo.physik.uni-muenchen.de
- cosmogw is a gateway machines and should not be used for computing
- there are 21 compute nodes named euclides01--euclides11 and euclides12--euclides21;
- euclides01-euclides11 have each 32 logical CPUs and 64GB of RAM;
- euclides12-euclides21 have each 56 logical CPUs and 128GB of RAM;
How to run jobs on the euclides nodes (using Slurm)¶
Use slurm to submit jobs or login to the euclides nodes (euclides01-21).
Control node cosmogw¶
The machine cosmogw is the login node and submit nodes for the slurm queue, so please do not use them as a simple compute nodes - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
You should use cosmogw to:- transfer files;
- develop your code;
- compile your code;
- submit jobs to the nodes via the slurm queues;
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
euclides nodes¶
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/).
Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin
All slurm commands listed below have very helpful man pages (e.g. 'man slurm', 'man squeue', ...).
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf.
Scheduling of Jobs¶
At this point there are four queues, called partitions in slurm:- on cosmogw:
- normal which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
four days; this queue comprises the computing nodes euclides01-19; - the lowpri partition also comprises the computing nodes euclides01-19; it is a so called preempty queue, allowing more resources for the users; however jobs are re-queued (canceled and re-scheduled) if the resources are demanded on the normal queue;
- eucliddevel which is intended for software development, especially if he 'normal' is full; this queue comprises the computing nodes euclides20-21; people from the Euclid group have an account on this queue; each user is allowed to use up to 56 cpus;
- cosmodevel which is intended for software development, especially if he 'normal' is full; this queue comprises the computing nodes euclides20-21; people from the cosmology group have an account on this queue; each user is allowed to use up to 4 cpus; note that this queue is preempt, meaning that the users of the queue eucliddevel precedence;
- normal which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
resources it will consume.
Job scheduling is a complex issue and we can and have to adjust to the users need whenever possible. Please feel free to speak out if
there is something that can be improved without creating an unfair disadvantage for other users.
You can run interactive jobs on all partitions.
Running an interactive job with slurm (a.k.a. logging in)¶
To run an interactive job with slurm in the default partition, use
srun -u --pty bash
If you want to use tcsh use
srun -u --pty tcsh
If you want to use a larger memory per job do
srun -u --mem-per-cpu=8000 --pty tcsh
In case you want to open x11 applications, use the --x11=first option, e.g.
srun --x11=first -u --pty bash
Opening an interactive session on one of the development partitions is done with:
srun --account=<euclid_dev/cosmo_dev> --partition=<eucliddevel/cosmodevel> --x11=first -u --pty bash
limited ssh access¶
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
Running a simple one core batch job with slurm using the default partition¶
- To see what queues are available to you (called partitions in slurm), run:
sinfo
- To run slurm, create a myjob.slurm containing the following information:
#!/bin/bash #SBATCH --output=slurm.%N.%j.out #SBATCH --error=slurm.%N.%j.err #SBATCH --mail-user <put your email address here> #SBATCH --mail-type=BEGIN #SBATCH -p normal #SBATCH --ntasks=1 /bin/hostname
Note that the %N and %j resolves in the node name and slurm job ID number (e.g. "slurm.euclides09.461336.out"), respectively. This information as part of the logging files automatically generates run specific log files that are not overwritten by the next runs and are then available for a long term evaluation of the slurm runs.
- To submit a batch job use:
sbatch myjob.slurm
- To see the status of you job, use
squeue
- To kill a job use:
scancel <jobid>
the <jobid> you can get from using squeue.
- For some more information on your job use
scontrol show job <jobid>
the <jobid> you can get from using squeue.
Running a simple once core batch job with slurm using the lowpri partition¶
Change the partition to lowpri and add the appropriate account depending if you're part of
the euclid or cosmology group.
#!/bin/bash #SBATCH --output=slurm.out #SBATCH --error=slurm.err #SBATCH --mail-user <put your email address here> #SBATCH --mail-type=BEGIN #SBATCH --account=[euclid_lowpri/cosmo_lowpri] #SBATCH --partition=lowpri #SBATCH --ntasks=1 /bin/hostname
Accessing a node where a job is running or starting additional processes on a node¶
You can attach an srun command to an already existing job (batch or interactive). This
means you can start an interactive session on a node where a job of yours is running
or start an additional process.
First determine the jobid of the desired job using squeue, then use
srun --jobid <jobid> [options] <executable>
Or more concrete
srun --jobid <jobid> -u --pty bash # to start an interactive session srun --jobid <jobid> ps -eaFAl # to start get detailed process information
The processes will only run on cores that have been allocated to you. This works
for batch as well as interactive jobs.
Important: If the original job that was submitted is finished, any process
attached in this fashion will be killed.
Batch script for running a multi-core job¶
mpi is installed on cosmofs1.
To run a 4 core job for an executable compiled with mpi you can use
#!/bin/bash #SBATCH --output=slurm.out #SBATCH --error=slurm.err #SBATCH --mail-user <put your email address here> #SBATCH --mail-type=BEGIN #SBATCH --ntasks=4 mpirun <programname>
and it will automatically start on the number of nodes specified.
To ensure that the job is being executed on only one node, add
#SBATCH -n 4
to the job script.
If you would like to run a program that itself starts processes, you can use the
environment variable $SLURM_NPROCS that is automatically defined for slurm
jobs to explicitly pass the number of cores the program can run on.
To check if your job is acutally running on the specified number of cores, you can check
the PSR column of
ps -eaFAl # or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
environment for jobs¶
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
To use your usual system environment, add the following line in the submission script:
#SBATCH --get-user-env
Slurm reporting and accounting¶
For information on job usage and cluster utilization for slurm jobs the slurm command "sreport" can be used. E.g. the command:
sreport user topusage start=01/15/18 -t percent
shows the top ten users in percent since January 15th 2018. For more information please look at "man sreport".
For accounting on specific jobs the slurm command "sacct" can be used. E.g. the command:
sacct -j 18551 --format=JobID,JobName,MaxRSS,Elapsed
displays information (elapsed time, memory usage, ...) on the job number "18551".
As usual with slurm commands there is a ton of options to re-define the search and format the desired output. The command:
sacct --format Start,End,User,JobID,state,partition -u <name> --starttime 2023-07-01
lists all jobs of the user <user> since 1st July 2023. The information provided includes the end state and the end time and so on.
For more details please use "man sacct".
Some points on the 'normal' versus 'lowpri' queue on cosmogw¶
The allowances for each user on the normal partition are 304CPU's and 661335MB, which corresponds to about 1/3 of the entire cluster (euclides01-19). In short, every user is allowed to use up to 1/3 of the cluster in the normal partition.
On the partition lowpri (for low priority) there are no limits on the CPU numbers or RAM consumption, meaning the user can take all available resources up to the entire cluster! However, jobs on the partition "lowpri" have a lower priority through the so called preemption mechanism. This means if all nodes are busy (partially through the lowpri queue) and an additional job is submitted to the "normal" partition, slurm will re-queue (meaning cancel and re-schedule to the lowpri-queue) job(s) on the "lowpri" partition to get the job on the "normal" partition running.
Here is an example scenario to illustrate the opportunities the "lowpri" partition offers:
I want to submit a number of jobs for in total 744cpu's. The entire cluster has 744 cpu's in total, this means in the optimal case I get 1/3 of the cluster on the "normal" partition, and it takes at least three cycles to get all my jobs finished. However, if I submit to the "lowpri" partition, in the case of an empty cluster I can use the entire cluster and finish in only one cycle. Of course it may happen that other users submit lots of jobs to the "normal" partition afterwards and many of my jobs are being re-queued. That would then delay the finishing of my jobs on the "lowpri" partition correspondingly. To highlight some aspects of using the "lowpri" partition:
- it is relevant especially when you want to submit several jobs that significantly exceed the user allowance on the "normal" partition and need the entire cluster to get finished;
- on average, the available ressources on the "lowpri" partition are much larger than on the "normal" partition, especially during the night or on the weekend;
- please not that no job gets ever lost at the "lowpri" partition; if re-queuing occurs, the user gets an email (Subject: "SLURM Job_id=2563 Name=test_mpi_gather.slurm Failed, Run time 00:01:58, PREEMPTED, ExitCode 0") when the job is stopped and subsequently when it starts again and when it finishes (see 1.);
- also on the "lowpri" partition there is a queue which decides which job comes first (of course only in the case of an oversubscription);
- the preemption mechanism tries to minimize the number of re-queued jobs necessary to get the job in the "normal" partition going; so, if 8 cpus are requested and the "lowpri" partion contains one job using 8 cpus, three jobs using 4 cpus and several dozens jobs using 1 cpu, only the job with 8 cpus is re-scheduled independent on the run times and other parameters.
To submit a job to the "lowpri" partition please insert the following lines into the slurm batch script (see also example above):
#SBATCH --account=<your account> #SBATCH -p lowpri
with <your_acount> being either "cosmo_lowpri" or "euclid_lowpri".
There are two typical scenarios where a user can gain from the lowpri queue:- if a job stores intermediate results at regular intervals and picks up from there once started again; then even a long job looses only the computing time since the last storage point if a job is re-scheduled;
- if a single job needs only a small amount of computing time (perhaps <12h) but a lot of jobs need to be run; then the loss of computing time is rather small if a job is re-scheduled;
desdb node¶
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
Software specific setup¶
Python environment¶
You can use the python 2.7.3 installed on the euclides cluster by using
source /data2/users/ccsoft/etc/setup_all source /data2/users/ccsoft/etc/setup_python2.7.3
Notes For Euclid users¶
For those submitting jobs to euclides* nodes through Cosmo DM pipeline here are some things which need to be specified for customized job submissions,
since a different interface to slurm is used.
- To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
case you want to use it for all blocks)
- If you want to run on multiple cores/cores then use
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
to use it for all blocks.
- If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
- note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
Running specific applications on the cluster¶
idl¶
We do have idl installed on our cluster with sufficient licences. Running idl on a computing node requires a specific setup plus then user specific adjustements. This process is too complicated to explain here. Please ask Matthias Klein, our local expert, if you want to use with idl on the cluster.
jupyter notebook¶
Jupyter notebook is a very handy tool to do prototyping in the software development process. Jupyter notebook is a client-server process, where the client runs on a browser. On the cluster it is posssible to run the jupyter notebook on the server cosmogw but have the client run locally on your laptop.
With this approach the user has on the one hand the cluster environment for processing and file storage and on the other hand a convenient development environment on the local browser with a minimum of data transfer.
To run jupyter notebook in this setup you need to (follwing the process here) :
- From you local host connect to cosmogw:
local$ ssh -Y <user>@cosmogw.kosmo.physik.uni-muenchen.de
- On cosmogw, open jupyer notebook for a specific port:
cosmogw$ jupyter notebook --no-browser --port=<your remote port>
- On the local host, forward the port <your remote port> to the port <your local port>
local$ ssh -N -f -L localhost:<your local port>:localhost:<your remote port> <user>@cosmogw.kosmo.physik.uni-muenchen.de
- Now you can connect to the jupyter notebook server running on cosmogw by connecting in your browser
localhost:<your local port>
- When you do the the first time, you may have to authenticate yourself using the token shown when firing up the jupyter notebook on cosmogw
- While the local port number <your local port> is alsmost arbitrary (well, whould not be used on your local machine by other services), the remote port number <your remote port> needs to be unique in order to not interfere with other users. I would recommend to always use the same <your remote port> and to use the number 8000+"your birth day" (which would be 8014 for me) to generate some kind of a random numbers for all users.
- After you are done, please do not forget to kill the port forwarding process on your local machine. you can find the relevant process number with
ps -ef | grep <your local port>
While this process seems to be a bit complicated , but with aliases and pre-defined functions in your .bashrc/.profilerc the setup becomes quite natural, and the speedup and convenience makes it worth in any case.
While jupyter notebooks are quite handy for prototyping or SW development no processing is allowed on cosmogw, since this puts too much load onto this gateway machine! When going for processing or production, please export your code from jupyer notebook!
Connecting to cosmogw¶
How to setup a VNC connection to cosmogw¶
A Virtual Network Connection offers a convenient and fast way to connect to cosmogw. The user sets up a desktop on cosmogw and then connects directly to this laptop. Aplications such as xterms, editors or browsers are kept in between the connections and the time delay is indeed minimal. It is like working on your own desktop.
To setup the VNC connection you have to do the following:
- ssh to from your laptop to cosmogw or any other access machine:
$ ssh <your name>@cosmogw.kosmo.physik.uni-muenchen.de
- on cosmogw start a vnc session using:
$ vncserver
Some notes on this:- the first time you have to pick a password which will be asked for when establishing a remote connection;
- the comand 'vncserver' has lots of options, such as the geometry of the desktop 'vncserver-geometry 2500x1400'
- logs and more information on the current VNC is available on '$HOME/./vnc/xstartup'
- the command 'vncserver' gives a sesstion number which in this case is '3':
$ vncserver
New 'cosmogw:3 (<your name>)' desktop is cosmogw:3Starting applications specified in /home/<your name>/.vnc/xstartup
Log file is /home/<your name>/.vnc/gatezero:3.log
- to be able connecting to this session you need to establish a ssh tunnel connection from you laptop to gatezero via:
$ssh -C -L 5901:localhost:5903 -N -f -l <your name>@cosmogw.kosmo.physik.uni-muenchen.de
There are some magical numbers:- 5900 seems to be base port number for a VNC connection
- 5903 = 5900 + 3 connects to session 3 established with 'vncserver'
- 5901 = 5900 + 1 the send the connection to port 1 on your laptop (--> localhost:1)
- now you can start the client on your laptop and connect to localhost:1 (my client also accepts localhost:5902); you will be asked for the password delared for 'vncserver'.
- after this setup the next connection is established generating the tunnel and connecting with the client
- when you re-connect to the VNC the desktop is in the state you left it, meaning all the shells editors etc. ares still there.
- from that desktop you can start interactive slurm shells or start slurm scripts etc.;
Admin¶
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some scripts re-starting slurm, adding a user and similar things are in "/data1/users/slurm/cosmo". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
Slurm configuration¶
Slurm configuration file¶
The currently valid version of the configuration file is "/data1/users/slurm/cosmo/slurm.conf" on cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used.
The script
- copies the configuration file to the submit node and restarts the submit service;
- copies the configuration file to all computing nodes and triggers the reconfiguration there;
Then the slurm daemon needs to be started on the submit node and all computing nodes with the script "restart.sh".
Note: Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
User management¶
Overview over users, accounts, etc.¶
No sudo access needed:
/usr/local/bin/sacctmgr show account withassoc
Adding a new user¶
As root on cosmofs1
,
cd /data1/users/slurm/ ./add_user.sh UserName account(cosmo or euclid) /usr/local/bin/.scontrol reconfigure
To increase memory, cores etc for a user¶
Inside script above, various commands for changing user settings, e.g.
/usr/local/bin/sacctmgr -i modify user name=$1 set GrpCPUs=32 /usr/local/bin/sacctmgr -i modify user name=$1 set GrpMem=128000
Modifying a running job¶
I is possible to change the parameters of a running job with 'scontrol'. E.g. to allow for more time it is:
/usr/local/bin/scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
Trouble shooting¶
Information on a particular node¶
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
Node in state "drain"¶
When a node is in "drain" state when calling
sinfo
run
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
to put it back to operation.
Restart¶
A full running of slurm requires:
- running the data base mysql;
- running the slurm data base daemon slurmdpd (for accounting);
- running slurmctld on cosmogw;
- slurmd on all nodes;
mysql¶
Mysql is started with 'systemctl start mysql'. The log is in '/var/log/mysqld.log' At one re-start (January 2019) the log said "/usr/sbin/mysqld: Can't create/write to file '/var/run/mysqld/mysqld.pid'". Then '/var/run/mysqld' did not exist (somehow disappeared). It had to be created and given to the owner 'mysql'. Then the file mysql.pid is created and mysql seems to be working fine.
slurmdbd¶
Should be started with 'systemctl status slurmdbd'. However this does not seem to work always (at least not on January 2019). It is possible to start the daemon directly with '/usr/local/sbin/slurmdbd'. The log of slurm is in '/var/log/slurm/slurmdbd.log'.
slurmctld and slurmd¶
A re-start of the slurm daemons ('slurmctld' on cosmogw and 'slurmd' on the nodes) is done bye executing the script:
/data1/users/slurm/cosmo/restart.sh
Problems running MPI¶
Over years there gave been reports that some nodes can not be used by Markov Chain Monte Carlo jobs, since the jobs do not start. Looks like this might be cause by RAM used by RAM disk storage from processes long gone.
In March 2023 up to 12GB of the RAM for one node was locked by this. Often this cones from failed cosmodm pipeline runs. Supposedly cosmodm mops up RAM disk, but only backwards, in the sense that the RAM disk is cleaned at the beginning of a new run.
There is the script:
toNodes.sh "command -opt1 v1 -opt2 v2 ..."
executes a command on all cluster nodes. This helps to clean out the RAM disk and so on.
Nodes down¶
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some troubleshooting for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
If a job does not finish and remains int eh state "CG" then the sequence:
/usr/local/bin/scontrol update NodeName=euclides01 State=down Reason=hung_proc /usr/local/bin/scontrol update NodeName=euclides01 State=resume Reason=hung_proc
brings the node back again.
History¶
Incident by machine¶
Node | Date | Reason | Solution | Comment |
---|---|---|---|---|
euclides13 | Jan 11th 24 | drain | resume with scontrol | |
euclides03 | Nov 8th 23 | down* | kill and restart the slurmd, resume scontrol | |
euclides06 | Nov 8th 23 | draining | resume with scontrol | |
euclides16 | February 4th 22 | drain | resume | |
euclides15 | February 4th 22 | drain | resume | |
euclides19 | October 15th 21 | down* | restart and resume | |
euclides05 | June 22nd 21 | down* | restart and resume | |
euclides01 | Feb. 9th 21 | down* | restart and resume | |
euclides14 | Dec. 7th 20 | down | restart and resume | |
euclides12 | Nov. 18th 20 | down | restart and resume | |
euclides14 | Nov. 4th 20 | drain | resume | |
euclides14 | Oct. 16th 20 | draining | ||
euclides16 | Sept. 17th 20 | down* | restart and resume | |
euclides01 | June. 30th 20 | down* | restart and resume | |
euclides02 | June. 30th 20 | down | resume | |
euclides12 | Jan. 18th 18 | CG | specific procedure (see below) |
Detailed description of the incident¶
- January 3rd 24: all new machines went down since the fuse of the powerplug they were connected blew. All except one came back again after establishing electricity.
- October 15th 21: it was not possible to connect to euclides19. After a reboot (button "Power Cycle System") everything worked again. Looks like down* is really associated to kind of network problems;
- June 22nd 2021:euclides05 was in state "down*". Setting it to down and resume set the status to "idle*". Turns out that slurmd was no longer working and had to be restarted.
- December 7th 2020: euclides14 was in state "down". Root ssh was not longer possible. Re-started the machine and brought it back with "scontrol". The entire process was really slow. Also it looks like the machine was not responsive to my test jobs since some days. Maybe it went from running a job directly to "down".
- November 18th 2020: euclides12 was in state "down". Root ssh was possible but not as user. Re-started the machine and brought it back with "scontrol";
- November 4th 2020: euclides14 was in state "drain". Resumed with "scontrol";
- October 16th 2020: euclides14 was in state "draining". Resumed with "scontrol";
- September 17th 2020: euclides16 in state "down*" (unable to ssh to). euclides16 had to be rebooted;
- June 30th 2020: euclides01 in state "down*" (unable to ssh to) and euclides02 in state "down". euclides02 could be integrated into the queue via scontrol, euclides01 had to be rebooted;
- June 4th 2020: another user (Aditya) got the error (as below):
$srun --x11=first -u -n 20 --mem=3000 --pty bash srun: error: plugin_load_from_file: dlopen(/usr/local/lib/slurm/select_cons_res.so): /usr/local/lib/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap srun: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed srun: fatal: Can't find plugin for select/cons_res
That was a setup problem as well. However, just unsetting "export LD_BIND_NOW=0" did not do the job. The user had a setup for IDL and connecting with a virtual desktop. Moving to a setup derived from mine solved the problem. However it is possible to connect with virtual desktop and work with IDL in slurm;
- January 2nd 2020: one user (Thomas) got the error:
[cosmogw][~] $ srun -u --mem-per-cpu=2000 --x11=first --cpus-per-task=56 --pty bash srun: error: plugin_load_from_file: dlopen(/usr/local/lib/slurm/select_cons_res.so): /usr/local/lib/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap srun: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed srun: fatal: Can't find plugin for select/cons_res
Turns out that the problem was to set "export LD_BIND_NOW=1" for another issue. After unsetting this slurm worked normally.
- December 30th 2019: somehow the controler job slurmctld on cosmogw was down. After re-star everything was okay.
- January 23rd 2018: Jobs on euclides12 are no longer finishing. They end up in the state "CG" and hang there forever. In the slurmd log there is the entry "[2018-01-23T10:12:17.477] [18153] error: Unable to establish controller machine" basically every 15mins or so. ssh from euclides12 to cosmogw via name and IP address was possible, so it is difficult to interpret this error message. At the end the problem was solved by:
- stopping slurmd
- removing /var/run/slurmd.pid
- creating /var/run/slurmd.pid via touch
- re-starting slurmd again
- euclides12 had before this sometimes created problems, maybe this was the culmination now.
- May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are not DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
- May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
- April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
- April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
- April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
- March 29th 2017: euclides7 is in drain state; "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
- March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;