Slurm » History » Version 31
Shantanu Desai, 10/21/2013 11:37 AM
1 | 21 | Kerstin Paech | {{toc}} |
---|---|---|---|
2 | 21 | Kerstin Paech | |
3 | 1 | Kerstin Paech | h1. How to run jobs on the euclides nodes |
4 | 1 | Kerstin Paech | |
5 | 7 | Kerstin Paech | Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future. |
6 | 1 | Kerstin Paech | |
7 | 9 | Kerstin Paech | *Please read through this entire wikipage so everyone can make efficient use of this cluster* |
8 | 9 | Kerstin Paech | |
9 | 1 | Kerstin Paech | h2. alexandria |
10 | 1 | Kerstin Paech | |
11 | 1 | Kerstin Paech | *Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. |
12 | 1 | Kerstin Paech | |
13 | 1 | Kerstin Paech | You should use alexandria to |
14 | 1 | Kerstin Paech | - transfer files |
15 | 1 | Kerstin Paech | - compile your code |
16 | 1 | Kerstin Paech | - submit jobs to the nodes |
17 | 1 | Kerstin Paech | |
18 | 1 | Kerstin Paech | If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below. |
19 | 1 | Kerstin Paech | |
20 | 1 | Kerstin Paech | h2. euclides nodes |
21 | 1 | Kerstin Paech | |
22 | 1 | Kerstin Paech | Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). |
23 | 1 | Kerstin Paech | *Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin* |
24 | 1 | Kerstin Paech | |
25 | 4 | Kerstin Paech | All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). |
26 | 4 | Kerstin Paech | |
27 | 4 | Kerstin Paech | If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf. |
28 | 1 | Kerstin Paech | |
29 | 1 | Kerstin Paech | h3. Scheduling of Jobs |
30 | 1 | Kerstin Paech | |
31 | 9 | Kerstin Paech | At this point there are two queues, called partitions in slurm: |
32 | 9 | Kerstin Paech | * *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of |
33 | 9 | Kerstin Paech | two days. Jobs at this point can only run on 1 node. |
34 | 16 | Kerstin Paech | * *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is |
35 | 16 | Kerstin Paech | 12 hours. |
36 | 1 | Kerstin Paech | |
37 | 9 | Kerstin Paech | We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending |
38 | 9 | Kerstin Paech | on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much |
39 | 9 | Kerstin Paech | resources it will consume. |
40 | 9 | Kerstin Paech | |
41 | 9 | Kerstin Paech | This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex |
42 | 9 | Kerstin Paech | issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if |
43 | 9 | Kerstin Paech | there is something that can be improved without creating an unfair disadvantage for other users. |
44 | 9 | Kerstin Paech | |
45 | 9 | Kerstin Paech | You can run interactive jobs on both partitions. |
46 | 9 | Kerstin Paech | |
47 | 1 | Kerstin Paech | h3. Running an interactive job with slurm |
48 | 1 | Kerstin Paech | |
49 | 9 | Kerstin Paech | To run an interactive job with slurm in the default partition, use |
50 | 1 | Kerstin Paech | |
51 | 1 | Kerstin Paech | <pre> |
52 | 14 | Kerstin Paech | srun -u --pty bash |
53 | 1 | Kerstin Paech | </pre> |
54 | 9 | Kerstin Paech | |
55 | 15 | Shantanu Desai | If you want to use tcsh use |
56 | 15 | Shantanu Desai | |
57 | 15 | Shantanu Desai | <pre> |
58 | 15 | Shantanu Desai | srun -u --pty tcsh |
59 | 15 | Shantanu Desai | </pre> |
60 | 15 | Shantanu Desai | |
61 | 30 | Shantanu Desai | If you want to use a larger memory per job do |
62 | 30 | Shantanu Desai | |
63 | 30 | Shantanu Desai | <pre> |
64 | 31 | Shantanu Desai | srun -u --mem-per-cpu=8000 --pty tcsh |
65 | 30 | Shantanu Desai | </pre> |
66 | 30 | Shantanu Desai | |
67 | 20 | Kerstin Paech | In case you want to open x11 applications, use the --x11=first option, e.g. |
68 | 20 | Kerstin Paech | <pre> |
69 | 20 | Kerstin Paech | srun --x11=first -u --pty bash |
70 | 20 | Kerstin Paech | </pre> |
71 | 20 | Kerstin Paech | |
72 | 9 | Kerstin Paech | In case the 'normal' partition is overcrowded, to use the 'debug' partition, use: |
73 | 9 | Kerstin Paech | <pre> |
74 | 14 | Kerstin Paech | srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group |
75 | 14 | Kerstin Paech | srun --account euclid_debug -p debug -u --pty bash # if you are part of the EuclidDM group |
76 | 12 | Kerstin Paech | </pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes. |
77 | 1 | Kerstin Paech | |
78 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the default partition |
79 | 1 | Kerstin Paech | |
80 | 1 | Kerstin Paech | * To see what queues are available to you (called partitions in slurm), run: |
81 | 1 | Kerstin Paech | <pre> |
82 | 1 | Kerstin Paech | sinfo |
83 | 1 | Kerstin Paech | </pre> |
84 | 1 | Kerstin Paech | |
85 | 1 | Kerstin Paech | * To run slurm, create a myjob.slurm containing the following information: |
86 | 1 | Kerstin Paech | <pre> |
87 | 1 | Kerstin Paech | #!/bin/bash |
88 | 1 | Kerstin Paech | #SBATCH --output=slurm.out |
89 | 1 | Kerstin Paech | #SBATCH --error=slurm.err |
90 | 1 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
91 | 1 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
92 | 8 | Kerstin Paech | #SBATCH -p normal |
93 | 1 | Kerstin Paech | |
94 | 1 | Kerstin Paech | /bin/hostname |
95 | 1 | Kerstin Paech | </pre> |
96 | 1 | Kerstin Paech | |
97 | 1 | Kerstin Paech | * To submit a batch job use: |
98 | 1 | Kerstin Paech | <pre> |
99 | 1 | Kerstin Paech | sbatch myjob.slurm |
100 | 1 | Kerstin Paech | </pre> |
101 | 1 | Kerstin Paech | |
102 | 1 | Kerstin Paech | * To see the status of you job, use |
103 | 1 | Kerstin Paech | <pre> |
104 | 1 | Kerstin Paech | squeue |
105 | 1 | Kerstin Paech | </pre> |
106 | 1 | Kerstin Paech | |
107 | 11 | Kerstin Paech | * To kill a job use: |
108 | 11 | Kerstin Paech | <pre> |
109 | 11 | Kerstin Paech | scancel <jobid> |
110 | 11 | Kerstin Paech | </pre> the <jobid> you can get from using squeue. |
111 | 11 | Kerstin Paech | |
112 | 1 | Kerstin Paech | * For some more information on your job use |
113 | 1 | Kerstin Paech | <pre> |
114 | 1 | Kerstin Paech | scontrol show job <jobid> |
115 | 11 | Kerstin Paech | </pre>the <jobid> you can get from using squeue. |
116 | 1 | Kerstin Paech | |
117 | 10 | Kerstin Paech | h3. Running a simple once core batch job with slurm using the debug partition |
118 | 10 | Kerstin Paech | |
119 | 10 | Kerstin Paech | Change the partition to debug and add the appropriate account depending if you're part of |
120 | 10 | Kerstin Paech | the euclid or cosmology group. |
121 | 10 | Kerstin Paech | |
122 | 10 | Kerstin Paech | <pre> |
123 | 10 | Kerstin Paech | #!/bin/bash |
124 | 10 | Kerstin Paech | #SBATCH --output=slurm.out |
125 | 10 | Kerstin Paech | #SBATCH --error=slurm.err |
126 | 10 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
127 | 10 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
128 | 10 | Kerstin Paech | #SBATCH -p debug |
129 | 10 | Kerstin Paech | #SBATCH -account [cosmo_debug/euclid_debug] |
130 | 10 | Kerstin Paech | |
131 | 10 | Kerstin Paech | /bin/hostname |
132 | 10 | Kerstin Paech | </pre> |
133 | 10 | Kerstin Paech | |
134 | 22 | Kerstin Paech | h3. Accessing a node where a job is running or starting additional processes on a node |
135 | 22 | Kerstin Paech | |
136 | 25 | Kerstin Paech | You can attach an srun command to an already existing job (batch or interactive). This |
137 | 22 | Kerstin Paech | means you can start an interactive session on a node where a job of yours is running |
138 | 26 | Kerstin Paech | or start an additional process. |
139 | 22 | Kerstin Paech | |
140 | 22 | Kerstin Paech | First determine the jobid of the desired job using squeue, then use |
141 | 22 | Kerstin Paech | |
142 | 22 | Kerstin Paech | <pre> |
143 | 22 | Kerstin Paech | srun --jobid <jobid> [options] <executable> |
144 | 22 | Kerstin Paech | </pre> |
145 | 22 | Kerstin Paech | Or more concrete |
146 | 22 | Kerstin Paech | <pre> |
147 | 22 | Kerstin Paech | srun --jobid <jobid> -u --pty bash # to start an interactive session |
148 | 22 | Kerstin Paech | srun --jobid <jobid> ps -eaFAl # to start get detailed process information |
149 | 22 | Kerstin Paech | </pre> |
150 | 22 | Kerstin Paech | |
151 | 24 | Kerstin Paech | The processes will only run on cores that have been allocated to you. This works |
152 | 24 | Kerstin Paech | for batch as well as interactive jobs. |
153 | 23 | Kerstin Paech | *Important: If the original job that was submitted is finished, any process |
154 | 23 | Kerstin Paech | attached in this fashion will be killed.* |
155 | 22 | Kerstin Paech | |
156 | 10 | Kerstin Paech | |
157 | 6 | Kerstin Paech | h3. Batch script for running a multi-core job |
158 | 6 | Kerstin Paech | |
159 | 17 | Kerstin Paech | mpi is installed on alexandria. |
160 | 17 | Kerstin Paech | |
161 | 18 | Kerstin Paech | To run a 4 core job for an executable compiled with mpi you can use |
162 | 6 | Kerstin Paech | <pre> |
163 | 6 | Kerstin Paech | #!/bin/bash |
164 | 6 | Kerstin Paech | #SBATCH --output=slurm.out |
165 | 6 | Kerstin Paech | #SBATCH --error=slurm.err |
166 | 6 | Kerstin Paech | #SBATCH --mail-user <put your email address here> |
167 | 6 | Kerstin Paech | #SBATCH --mail-type=BEGIN |
168 | 6 | Kerstin Paech | #SBATCH -n 4 |
169 | 1 | Kerstin Paech | |
170 | 18 | Kerstin Paech | mpirun <programname> |
171 | 1 | Kerstin Paech | |
172 | 1 | Kerstin Paech | </pre> |
173 | 18 | Kerstin Paech | and it will automatically start on the number of nodes specified. |
174 | 1 | Kerstin Paech | |
175 | 18 | Kerstin Paech | To ensure that the job is being executed on only one node, add |
176 | 18 | Kerstin Paech | <pre> |
177 | 18 | Kerstin Paech | #SBATCH -n 4 |
178 | 18 | Kerstin Paech | </pre> |
179 | 18 | Kerstin Paech | to the job script. |
180 | 17 | Kerstin Paech | |
181 | 19 | Kerstin Paech | If you would like to run a program that itself starts processes, you can use the |
182 | 19 | Kerstin Paech | environment variable $SLURM_NPROCS that is automatically defined for slurm |
183 | 19 | Kerstin Paech | jobs to explicitly pass the number of cores the program can run on. |
184 | 19 | Kerstin Paech | |
185 | 17 | Kerstin Paech | To check if your job is acutally running on the specified number of cores, you can check |
186 | 17 | Kerstin Paech | the PSR column of |
187 | 17 | Kerstin Paech | <pre> |
188 | 17 | Kerstin Paech | ps -eaFAl |
189 | 17 | Kerstin Paech | # or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs |
190 | 6 | Kerstin Paech | </pre> |
191 | 27 | Jiayi Liu | |
192 | 28 | Kerstin Paech | h3. environment for jobs |
193 | 27 | Jiayi Liu | |
194 | 29 | Kerstin Paech | By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...) |
195 | 29 | Kerstin Paech | |
196 | 28 | Kerstin Paech | To use your usual system environment, add the following line in the submission script: |
197 | 27 | Jiayi Liu | <pre> |
198 | 27 | Jiayi Liu | #SBATCH --get-user-env |
199 | 1 | Kerstin Paech | </pre> |
200 | 1 | Kerstin Paech | |
201 | 28 | Kerstin Paech | |
202 | 28 | Kerstin Paech | h2. Software specific setup |
203 | 28 | Kerstin Paech | |
204 | 28 | Kerstin Paech | h3. Python environment |
205 | 28 | Kerstin Paech | |
206 | 28 | Kerstin Paech | You can use the python 2.7.3 installed on the euclides cluster by using |
207 | 27 | Jiayi Liu | |
208 | 27 | Jiayi Liu | <pre> |
209 | 27 | Jiayi Liu | source /data2/users/ccsoft/etc/setup_all |
210 | 27 | Jiayi Liu | source /data2/users/ccsoft/etc/setup_bin |
211 | 27 | Jiayi Liu | source /data2/users/ccsoft/etc/setup_pybrew |
212 | 27 | Jiayi Liu | source /data2/users/ccsoft/etc/setup_python2.7.3 |
213 | 27 | Jiayi Liu | </pre? |