Slurm » History » Version 73
Version 72 (Martin Kuemmel, 05/18/2017 02:04 PM) → Version 73/136 (Martin Kuemmel, 05/18/2017 02:15 PM)
{{toc}}
h1. Hardware overview
You access the Euclid cluster through either cosmogw.kosmo.physik.uni-muenchen.de or cosmofs1.kosmo.physik.uni-muenchen.de
* cosmogw and cosmofs1 are gateway machines and should *not* be used for computing
* there are 11 compute nodes named euclides01--euclides11
* euclides01-05 are available via cosmofs1, and euclides05 can only be used for debugging, see below
* euclides06-11 are available via cosmogw;
* each node has 32 logical CPUs and 64GB of RAM
h1. How to run jobs on the euclides nodes (using Slurm)
Use slurm to submit jobs or login to the euclides nodes (euclides01-11).
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
h2. cosmogw/cosmofs1
*Please do not use csmogw or cosmofs1 as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
You should use cosmogw or cosmofs1 to
* transfer files
* compile your code
* submit jobs to the nodes via the slurm queues
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
h2. euclides nodes
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/).
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin*
All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...).
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf.
h3. Scheduling of Jobs
At this point there are three queues, called partitions in slurm:
* on cosmofs1:
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
two days. Jobs at this point can only run on 1 node.
** *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
12 hours.
* on cosmofgw:
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
two days. Jobs at this point can only run on 1 node.
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
resources it will consume.
This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
there is something that can be improved without creating an unfair disadvantage for other users.
You can run interactive jobs on both partitions.
h3. Running an interactive job with slurm (a.k.a. logging in)
To run an interactive job with slurm in the default partition, use
<pre>
srun -u --pty bash
</pre>
If you want to use tcsh use
<pre>
srun -u --pty tcsh
</pre>
If you want to use a larger memory per job do
<pre>
srun -u --mem-per-cpu=8000 --pty tcsh
</pre>
In case you want to open x11 applications, use the --x11=first option, e.g.
<pre>
srun --x11=first -u --pty bash
</pre>
In case the 'normal' partition on cosmofs1 is overcrowded, to use the 'debug' partition, use:
<pre>
srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
srun --account euclid_debug -p debug -u --pty bash # if you are part of the EuclidDM group
</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
h3. limited ssh access
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
h3. Running a simple once core batch job with slurm using the default partition
* To see what queues are available to you (called partitions in slurm), run:
<pre>
sinfo
</pre>
* To run slurm, create a myjob.slurm containing the following information:
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH -p normal
/bin/hostname
</pre>
* To submit a batch job use:
<pre>
sbatch myjob.slurm
</pre>
* To see the status of you job, use
<pre>
squeue
</pre>
* To kill a job use:
<pre>
scancel <jobid>
</pre> the <jobid> you can get from using squeue.
* For some more information on your job use
<pre>
scontrol show job <jobid>
</pre>the <jobid> you can get from using squeue.
h3. Running a simple once core batch job with slurm using the debug partition
Change the partition to debug and add the appropriate account depending if you're part of
the euclid or cosmology group.
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH --account [cosmo_debug/euclid_debug]
#SBATCH -p debug
/bin/hostname
</pre>
h3. Accessing a node where a job is running or starting additional processes on a node
You can attach an srun command to an already existing job (batch or interactive). This
means you can start an interactive session on a node where a job of yours is running
or start an additional process.
First determine the jobid of the desired job using squeue, then use
<pre>
srun --jobid <jobid> [options] <executable>
</pre>
Or more concrete
<pre>
srun --jobid <jobid> -u --pty bash # to start an interactive session
srun --jobid <jobid> ps -eaFAl # to start get detailed process information
</pre>
The processes will only run on cores that have been allocated to you. This works
for batch as well as interactive jobs.
*Important: If the original job that was submitted is finished, any process
attached in this fashion will be killed.*
h3. Batch script for running a multi-core job
mpi is installed on cosmofs1.
To run a 4 core job for an executable compiled with mpi you can use
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH -n 4
mpirun <programname>
</pre>
and it will automatically start on the number of nodes specified.
To ensure that the job is being executed on only one node, add
<pre>
#SBATCH -n 4
</pre>
to the job script.
If you would like to run a program that itself starts processes, you can use the
environment variable $SLURM_NPROCS that is automatically defined for slurm
jobs to explicitly pass the number of cores the program can run on.
To check if your job is acutally running on the specified number of cores, you can check
the PSR column of
<pre>
ps -eaFAl
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
</pre>
h3. environment for jobs
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
To use your usual system environment, add the following line in the submission script:
<pre>
#SBATCH --get-user-env
</pre>
h2. desdb node
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
h2. Software specific setup
h3. Python environment
You can use the python 2.7.3 installed on the euclides cluster by using
<pre>
source /data2/users/ccsoft/etc/setup_all
source /data2/users/ccsoft/etc/setup_python2.7.3
</pre>
h2. Notes For Euclid users
For those submitting jobs to euclides* nodes through Cosmo DM pipeline here are some things which need to be specified for customized job submissions,
since a different interface to slurm is used.
* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
case you want to use it for all blocks)
* If you want to run on multiple cores/cores then use
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
to use it for all blocks.
* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
* note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
h1. Admin
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some script for adding a user and similar things are in "/data1/users/slurm". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
h2. Slurm configuration
h3. Slurm configuration file
The currently valid version of the configuration file are "/data1/users/slurm/slurm.conf" and "/data1/users/slurm/cosmo/slurm.conf" on cosmofs1 and cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used.
The script
* copies the configuration file to the submit node and restarts the submit service;
* copies the configuration file to all computing nodes and triggers the reconfiguration there;
Then the slurm daemon needs to be started on the submit and all computing nodes with the script "restart.sh".
*Note:* Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
h2. User management
h3. Overview over users, accounts, etc.
No sudo access needed:
<pre>
/usr/local/bin/sacctmgr show account withassoc
</pre>
h3. Adding a new user
As root on @cosmofs1@,
<pre>
cd /data1/users/slurm/
./add_user.sh UserName account(cosmo or euclid)
/usr/local/bin/.scontrol reconfigure
</pre>
h3. To increase memory, cores etc for a user
Inside script above, various commands for changing user settings, e.g.
<pre>
/usr/local/bin/sacctmgr -i modify user name=$1 set GrpCPUs=32
/usr/local/bin/sacctmgr -i modify user name=$1 set GrpMem=128000
</pre>
h2. Trouble shooting
h3. Information on a particular node
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
h3. Node in state "drain"
When a node is in "drain" state when calling <pre>sinfo</pre>
run
<pre>
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
</pre>
to put it back to operation.
h2. Nodes down
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some "troubleshooting":https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
h2. History
* May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are *not* DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
* May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
* April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
* April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
* April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
* March 29th 2017: euclides7 is in drain state; "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
* March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;
h1. Hardware overview
You access the Euclid cluster through either cosmogw.kosmo.physik.uni-muenchen.de or cosmofs1.kosmo.physik.uni-muenchen.de
* cosmogw and cosmofs1 are gateway machines and should *not* be used for computing
* there are 11 compute nodes named euclides01--euclides11
* euclides01-05 are available via cosmofs1, and euclides05 can only be used for debugging, see below
* euclides06-11 are available via cosmogw;
* each node has 32 logical CPUs and 64GB of RAM
h1. How to run jobs on the euclides nodes (using Slurm)
Use slurm to submit jobs or login to the euclides nodes (euclides01-11).
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
h2. cosmogw/cosmofs1
*Please do not use csmogw or cosmofs1 as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
You should use cosmogw or cosmofs1 to
* transfer files
* compile your code
* submit jobs to the nodes via the slurm queues
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
h2. euclides nodes
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/).
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin*
All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...).
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf.
h3. Scheduling of Jobs
At this point there are three queues, called partitions in slurm:
* on cosmofs1:
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
two days. Jobs at this point can only run on 1 node.
** *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
12 hours.
* on cosmofgw:
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
two days. Jobs at this point can only run on 1 node.
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
resources it will consume.
This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
there is something that can be improved without creating an unfair disadvantage for other users.
You can run interactive jobs on both partitions.
h3. Running an interactive job with slurm (a.k.a. logging in)
To run an interactive job with slurm in the default partition, use
<pre>
srun -u --pty bash
</pre>
If you want to use tcsh use
<pre>
srun -u --pty tcsh
</pre>
If you want to use a larger memory per job do
<pre>
srun -u --mem-per-cpu=8000 --pty tcsh
</pre>
In case you want to open x11 applications, use the --x11=first option, e.g.
<pre>
srun --x11=first -u --pty bash
</pre>
In case the 'normal' partition on cosmofs1 is overcrowded, to use the 'debug' partition, use:
<pre>
srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
srun --account euclid_debug -p debug -u --pty bash # if you are part of the EuclidDM group
</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
h3. limited ssh access
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
h3. Running a simple once core batch job with slurm using the default partition
* To see what queues are available to you (called partitions in slurm), run:
<pre>
sinfo
</pre>
* To run slurm, create a myjob.slurm containing the following information:
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH -p normal
/bin/hostname
</pre>
* To submit a batch job use:
<pre>
sbatch myjob.slurm
</pre>
* To see the status of you job, use
<pre>
squeue
</pre>
* To kill a job use:
<pre>
scancel <jobid>
</pre> the <jobid> you can get from using squeue.
* For some more information on your job use
<pre>
scontrol show job <jobid>
</pre>the <jobid> you can get from using squeue.
h3. Running a simple once core batch job with slurm using the debug partition
Change the partition to debug and add the appropriate account depending if you're part of
the euclid or cosmology group.
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH --account [cosmo_debug/euclid_debug]
#SBATCH -p debug
/bin/hostname
</pre>
h3. Accessing a node where a job is running or starting additional processes on a node
You can attach an srun command to an already existing job (batch or interactive). This
means you can start an interactive session on a node where a job of yours is running
or start an additional process.
First determine the jobid of the desired job using squeue, then use
<pre>
srun --jobid <jobid> [options] <executable>
</pre>
Or more concrete
<pre>
srun --jobid <jobid> -u --pty bash # to start an interactive session
srun --jobid <jobid> ps -eaFAl # to start get detailed process information
</pre>
The processes will only run on cores that have been allocated to you. This works
for batch as well as interactive jobs.
*Important: If the original job that was submitted is finished, any process
attached in this fashion will be killed.*
h3. Batch script for running a multi-core job
mpi is installed on cosmofs1.
To run a 4 core job for an executable compiled with mpi you can use
<pre>
#!/bin/bash
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-user <put your email address here>
#SBATCH --mail-type=BEGIN
#SBATCH -n 4
mpirun <programname>
</pre>
and it will automatically start on the number of nodes specified.
To ensure that the job is being executed on only one node, add
<pre>
#SBATCH -n 4
</pre>
to the job script.
If you would like to run a program that itself starts processes, you can use the
environment variable $SLURM_NPROCS that is automatically defined for slurm
jobs to explicitly pass the number of cores the program can run on.
To check if your job is acutally running on the specified number of cores, you can check
the PSR column of
<pre>
ps -eaFAl
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
</pre>
h3. environment for jobs
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
To use your usual system environment, add the following line in the submission script:
<pre>
#SBATCH --get-user-env
</pre>
h2. desdb node
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
h2. Software specific setup
h3. Python environment
You can use the python 2.7.3 installed on the euclides cluster by using
<pre>
source /data2/users/ccsoft/etc/setup_all
source /data2/users/ccsoft/etc/setup_python2.7.3
</pre>
h2. Notes For Euclid users
For those submitting jobs to euclides* nodes through Cosmo DM pipeline here are some things which need to be specified for customized job submissions,
since a different interface to slurm is used.
* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
case you want to use it for all blocks)
* If you want to run on multiple cores/cores then use
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
to use it for all blocks.
* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
* note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
h1. Admin
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some script for adding a user and similar things are in "/data1/users/slurm". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
h2. Slurm configuration
h3. Slurm configuration file
The currently valid version of the configuration file are "/data1/users/slurm/slurm.conf" and "/data1/users/slurm/cosmo/slurm.conf" on cosmofs1 and cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used.
The script
* copies the configuration file to the submit node and restarts the submit service;
* copies the configuration file to all computing nodes and triggers the reconfiguration there;
Then the slurm daemon needs to be started on the submit and all computing nodes with the script "restart.sh".
*Note:* Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
h2. User management
h3. Overview over users, accounts, etc.
No sudo access needed:
<pre>
/usr/local/bin/sacctmgr show account withassoc
</pre>
h3. Adding a new user
As root on @cosmofs1@,
<pre>
cd /data1/users/slurm/
./add_user.sh UserName account(cosmo or euclid)
/usr/local/bin/.scontrol reconfigure
</pre>
h3. To increase memory, cores etc for a user
Inside script above, various commands for changing user settings, e.g.
<pre>
/usr/local/bin/sacctmgr -i modify user name=$1 set GrpCPUs=32
/usr/local/bin/sacctmgr -i modify user name=$1 set GrpMem=128000
</pre>
h2. Trouble shooting
h3. Information on a particular node
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
h3. Node in state "drain"
When a node is in "drain" state when calling <pre>sinfo</pre>
run
<pre>
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
</pre>
to put it back to operation.
h2. Nodes down
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some "troubleshooting":https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
h2. History
* May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are *not* DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
* May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
* April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
* April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
* April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
* March 29th 2017: euclides7 is in drain state; "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
* March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;