Slurm » History » Version 18
Kerstin Paech, 09/27/2013 06:50 AM
1 | 1 | Kerstin Paech | h1. How to run jobs on the euclides nodes |
---|---|---|---|
2 | 1 | Kerstin Paech | |
3 | 7 | Kerstin Paech | Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future. |
4 | 1 | Kerstin Paech | |
5 | 9 | Kerstin Paech | *Please read through this entire wikipage so everyone can make efficient use of this cluster* |
6 | 9 | Kerstin Paech | |
7 | 1 | Kerstin Paech | h2. alexandria |
8 | 1 | Kerstin Paech | |
9 | 1 | Kerstin Paech | *Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. |
10 | 1 | Kerstin Paech | |
11 | 1 | Kerstin Paech | You should use alexandria to |
12 | 1 | Kerstin Paech | - transfer files |
13 | 1 | Kerstin Paech | - compile your code |
14 | 1 | Kerstin Paech | - submit jobs to the nodes |
15 | 1 | Kerstin Paech | |
16 | 1 | Kerstin Paech | If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below. |
17 | 1 | Kerstin Paech | |
18 | 1 | Kerstin Paech | h2. euclides nodes |
19 | 1 | Kerstin Paech | |
20 | 1 | Kerstin Paech | Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). |
21 | 1 | Kerstin Paech | *Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin* |
22 | 1 | Kerstin Paech | |
23 | 4 | Kerstin Paech | All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). |
24 | 4 | Kerstin Paech | |
25 | 4 | Kerstin Paech | If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf. |
26 | 1 | Kerstin Paech | |
27 | 1 | Kerstin Paech | h3. Scheduling of Jobs |
28 | 1 | Kerstin Paech | |
29 | 9 | Kerstin Paech | At this point there are two queues, called partitions in slurm: |
30 | 9 | Kerstin Paech | * *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of |
31 | 9 | Kerstin Paech | two days. Jobs at this point can only run on 1 node. |
32 | 16 | Kerstin Paech | * *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is |
33 | 16 | Kerstin Paech | 12 hours. |
34 | 1 | Kerstin Paech | |
35 | 9 | Kerstin Paech | We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending |
36 | 9 | Kerstin Paech | on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much |
37 | 9 | Kerstin Paech | resources it will consume. |
38 | 9 | Kerstin Paech | |
39 | 9 | Kerstin Paech | This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex |
40 | 9 | Kerstin Paech | issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if |
41 | 9 | Kerstin Paech | there is something that can be improved without creating an unfair disadvantage for other users. |
42 | 9 | Kerstin Paech | |
43 | 9 | Kerstin Paech | You can run interactive jobs on both partitions. |
44 | 9 | Kerstin Paech | |
45 | 1 | Kerstin Paech | h3. Running an interactive job with slurm |
46 | 1 | Kerstin Paech | |
47 | 9 | Kerstin Paech | To run an interactive job with slurm in the default partition, use |
48 | 1 | Kerstin Paech | |
49 | 1 | Kerstin Paech | <pre> |
50 | 14 | Kerstin Paech | srun -u --pty bash |
51 | 1 | Kerstin Paech | </pre> |
52 | 9 | Kerstin Paech | |
53 | 15 | Shantanu Desai | If you want to use tcsh use |
54 | 15 | Shantanu Desai | |
55 | 15 | Shantanu Desai | <pre> |
56 | 15 | Shantanu Desai | srun -u --pty tcsh |
57 | 15 | Shantanu Desai | </pre> |
58 | 15 | Shantanu Desai | |
59 | 9 | Kerstin Paech | In case the 'normal' partition is overcrowded, to use the 'debug' partition, use: |
60 | 9 | Kerstin Paech | <pre> |
61 | 14 | Kerstin Paech | srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group |
62 | 14 | Kerstin Paech | srun --account euclid_debug -p debug -u --pty bash # if you are part of the EuclidDM group |
63 | 12 | Kerstin Paech | </pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes. |
64 | 1 | Kerstin Paech | |
65 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the default partition |
66 | 1 | Kerstin Paech | |
67 | 1 | Kerstin Paech | * To see what queues are available to you (called partitions in slurm), run: |
68 | 1 | Kerstin Paech | <pre> |
69 | 1 | Kerstin Paech | sinfo |
70 | 1 | Kerstin Paech | </pre> |
71 | 1 | Kerstin Paech | |
72 | 1 | Kerstin Paech | * To run slurm, create a myjob.slurm containing the following information: |
73 | 1 | Kerstin Paech | <pre> |
74 | 1 | Kerstin Paech | #!/bin/bash |
75 | 1 | Kerstin Paech | #SBATCH --output=slurm.out |
76 | 1 | Kerstin Paech | #SBATCH --error=slurm.err |
77 | 1 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
78 | 1 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
79 | 8 | Kerstin Paech | #SBATCH -p normal |
80 | 1 | Kerstin Paech | |
81 | 1 | Kerstin Paech | /bin/hostname |
82 | 1 | Kerstin Paech | </pre> |
83 | 1 | Kerstin Paech | |
84 | 1 | Kerstin Paech | * To submit a batch job use: |
85 | 1 | Kerstin Paech | <pre> |
86 | 1 | Kerstin Paech | sbatch myjob.slurm |
87 | 1 | Kerstin Paech | </pre> |
88 | 1 | Kerstin Paech | |
89 | 1 | Kerstin Paech | * To see the status of you job, use |
90 | 1 | Kerstin Paech | <pre> |
91 | 1 | Kerstin Paech | squeue |
92 | 1 | Kerstin Paech | </pre> |
93 | 1 | Kerstin Paech | |
94 | 11 | Kerstin Paech | * To kill a job use: |
95 | 11 | Kerstin Paech | <pre> |
96 | 11 | Kerstin Paech | scancel <jobid> |
97 | 11 | Kerstin Paech | </pre> the <jobid> you can get from using squeue. |
98 | 11 | Kerstin Paech | |
99 | 1 | Kerstin Paech | * For some more information on your job use |
100 | 1 | Kerstin Paech | <pre> |
101 | 1 | Kerstin Paech | scontrol show job <jobid> |
102 | 11 | Kerstin Paech | </pre>the <jobid> you can get from using squeue. |
103 | 1 | Kerstin Paech | |
104 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the debug partition |
105 | 10 | Kerstin Paech | |
106 | 10 | Kerstin Paech | Change the partition to debug and add the appropriate account depending if you're part of |
107 | 10 | Kerstin Paech | the euclid or cosmology group. |
108 | 10 | Kerstin Paech | |
109 | 10 | Kerstin Paech | <pre> |
110 | 10 | Kerstin Paech | #!/bin/bash |
111 | 10 | Kerstin Paech | #SBATCH --output=slurm.out |
112 | 10 | Kerstin Paech | #SBATCH --error=slurm.err |
113 | 10 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
114 | 10 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
115 | 10 | Kerstin Paech | #SBATCH -p debug |
116 | 10 | Kerstin Paech | #SBATCH -account [cosmo_debug/euclid_debug] |
117 | 10 | Kerstin Paech | |
118 | 10 | Kerstin Paech | /bin/hostname |
119 | 10 | Kerstin Paech | </pre> |
120 | 10 | Kerstin Paech | |
121 | 10 | Kerstin Paech | |
122 | 6 | Kerstin Paech | h3. Batch script for running a multi-core job |
123 | 6 | Kerstin Paech | |
124 | 17 | Kerstin Paech | mpi is installed on alexandria. |
125 | 17 | Kerstin Paech | |
126 | 18 | Kerstin Paech | To run a 4 core job for an executable compiled with mpi you can use |
127 | 6 | Kerstin Paech | <pre> |
128 | 6 | Kerstin Paech | #!/bin/bash |
129 | 6 | Kerstin Paech | #SBATCH --output=slurm.out |
130 | 6 | Kerstin Paech | #SBATCH --error=slurm.err |
131 | 6 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
132 | 6 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
133 | 6 | Kerstin Paech | #SBATCH -n 4 |
134 | 1 | Kerstin Paech | |
135 | 18 | Kerstin Paech | mpirun <programname> |
136 | 1 | Kerstin Paech | |
137 | 1 | Kerstin Paech | </pre> |
138 | 18 | Kerstin Paech | and it will automatically start on the number of nodes specified. |
139 | 1 | Kerstin Paech | |
140 | 18 | Kerstin Paech | To ensure that the job is being executed on only one node, add |
141 | 18 | Kerstin Paech | <pre> |
142 | 18 | Kerstin Paech | #SBATCH -n 4 |
143 | 18 | Kerstin Paech | </pre> |
144 | 18 | Kerstin Paech | to the job script. |
145 | 17 | Kerstin Paech | |
146 | 17 | Kerstin Paech | To check if your job is acutally running on the specified number of cores, you can check |
147 | 17 | Kerstin Paech | the PSR column of |
148 | 17 | Kerstin Paech | <pre> |
149 | 17 | Kerstin Paech | ps -eaFAl |
150 | 17 | Kerstin Paech | # or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs |
151 | 6 | Kerstin Paech | </pre> |