OMP thread placement on KNL SNC4

Mike_B_2 · ‎02-02-2017

We're seeing an unexpected (Intel) OMP thread placement in SNC4 mode.

The thread placement is different on each KNL node. While that by itself may ok, we're seeing entire numa nodes not being used, and some numa nodes getting double or triple the number of threads.

A nominal aprun launch is 8 OMP threads under a single MPI rank on a KNL node, in SNC4 mode. All of the nodes tested result in different core placements based on process id. Using the results of numactl -H (which lists what I'm calling process ids), it appears numa nodes are being skipped. If we add KMP_AFFINITY verbose, we can see the process id to core id mapping, and observe consistently the threads are being placed in order based on sequential core id. However, since the core id to process id mapping is seemingly random, so we end up with threads being clumped onto some numa nodes, leaving other entire numa nodes empty.

How do we ensure threads are distributed onto all numa regions? Control to also keep the thread assignment (close?) where they are sequential within a numa node (but still on all numa nodes, say 2 threads per numa node) is also desired. The default thread assignment where some numa nodes are empty, is not expected when in SNC4 mode.

It seems either the OMP runtime is placing the threads incorrectly, or the output of numactl is incorrect wrt which process ids are in the numa nodes.

thanks,

Mike

Lawrence_M_Intel · ‎02-03-2017

Can you please provide the output of numactl -H and the exact MPI command you are using, as well as the version of MPI and Intel Compilers?

Mike_B_2 · ‎02-03-2017

Hey Larry, sure.

Intel 17.0.1, MPICH 7.4.3 (Cray)

Below is the output from 2 nodes. We first noted the difference in thread placement on cores, and then realized in some cases numa nodes were being skipped. The KMP_AFFINITY verbose showed that thread assignment was being made by sequential order of the next core id, I think as expected. However, due to the core-id to proc-id mapping it seems like the proc ids are not ending up in all the numa nodes.

2 nodes 193, 195, both snc4/flat

OMP_NUM_THREADS=8

193 uses all 4 numa nodes, 195 skips numa node 2, and wraps back to numa node 0

I've included the numactl view of the 2 nodes, they look the same to me.

tt-login2 1028% aprun -n 1 -d 8 -j 1 -cc none -L 193 ./xthi_knl.intel | sort -n -k 5
Hello from rank 0, thread 0, on nid00193. (proc affinity = 0)
Hello from rank 0, thread 1, on nid00193. (proc affinity = 1)
Hello from rank 0, thread 2, on nid00193. (proc affinity = 18)
Hello from rank 0, thread 3, on nid00193. (proc affinity = 19)
Hello from rank 0, thread 4, on nid00193. (proc affinity = 36)
Hello from rank 0, thread 5, on nid00193. (proc affinity = 37)
Hello from rank 0, thread 6, on nid00193. (proc affinity = 52)
Hello from rank 0, thread 7, on nid00193. (proc affinity = 53)

tt-login2 1030% aprun -n 1 -d 8 -j 1 -cc none -L 195 ./xthi_knl.intel | sort -n -k 5
Hello from rank 0, thread 0, on nid00195. (proc affinity = 0)
Hello from rank 0, thread 1, on nid00195. (proc affinity = 1)
Hello from rank 0, thread 2, on nid00195. (proc affinity = 18)
Hello from rank 0, thread 3, on nid00195. (proc affinity = 19)
Hello from rank 0, thread 4, on nid00195. (proc affinity = 52)
Hello from rank 0, thread 5, on nid00195. (proc affinity = 53)
Hello from rank 0, thread 6, on nid00195. (proc affinity = 2)
Hello from rank 0, thread 7, on nid00195. (proc affinity = 3)

tt-login2 1029% aprun -n 1 -L 193 numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
node 0 size: 24059 MB
node 0 free: 22796 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 1 size: 24232 MB
node 1 free: 23645 MB
node 2 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 2 size: 24233 MB
node 2 free: 23969 MB
node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 3 size: 24233 MB
node 3 free: 23966 MB
node 4 cpus:
node 4 size: 4039 MB
node 4 free: 4023 MB
node 5 cpus:
node 5 size: 4039 MB
node 5 free: 4023 MB
node 6 cpus:
node 6 size: 4039 MB
node 6 free: 4023 MB
node 7 cpus:
node 7 size: 4037 MB
node 7 free: 4020 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 21 21 21 31 41 41 41
1: 21 10 21 21 41 31 41 41
2: 21 21 10 21 41 41 31 41
3: 21 21 21 10 41 41 41 31
4: 31 41 41 41 10 41 41 41
5: 41 31 41 41 41 10 41 41
6: 41 41 31 41 41 41 10 41
7: 41 41 41 31 41 41 41 10

tt-login2 1031% aprun -n 1 -L 195 numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
node 0 size: 24059 MB
node 0 free: 22724 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 1 size: 24232 MB
node 1 free: 23639 MB
node 2 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 2 size: 24233 MB
node 2 free: 23931 MB
node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 3 size: 24233 MB
node 3 free: 23932 MB
node 4 cpus:
node 4 size: 4039 MB
node 4 free: 4023 MB
node 5 cpus:
node 5 size: 4039 MB
node 5 free: 4023 MB
node 6 cpus:
node 6 size: 4039 MB
node 6 free: 4023 MB
node 7 cpus:
node 7 size: 4037 MB
node 7 free: 4020 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 21 21 21 31 41 41 41
1: 21 10 21 21 41 31 41 41
2: 21 21 10 21 41 41 31 41
3: 21 21 21 10 41 41 41 31
4: 31 41 41 41 10 41 41 41
5: 41 31 41 41 41 10 41 41
6: 41 41 31 41 41 41 10 41
7: 41 41 41 31 41 41 41 10

SergeyKostrov · ‎02-03-2017

Could you also verify current Cluster and MCDRAM modes by executing hwloc-dump-hwdata?

Mike_B_2 · ‎02-03-2017

The attached spreadsheet shows the results from running the above command on 3 different SNC4/flat nodes.

Columns are node id, process id, core id (that is mapped to process id identified by KMP_AFFIINITY verbose), hyper thread id (id'd by verbose, always 0 in these cases), and numa node (as id'd by numactl -H for that process id). Highlights are the threads used by these runs. The rest just include the complete output of verbose with numa nodes added by process id. This view is sorted by core id, which showed why the threads were being assigned in this order. Originally each node id set was sorted by process id, which showed assignment by process id, but re-sorted as it is now, helped understand more what was happening with thread to core id assignment.

SergeyKostrov · ‎02-03-2017

One more question: Did you set OMP_NESTED and OMP_NUM_THREADS environment variables? Something like, ... OMP_NESTED=1 OMP_NUM_THREADS=4,64 ...

Mike_B_2 · ‎02-03-2017

Still waiting on a node allocation to try hwloc-dump-hwdata.

OMP_NUM_THREADS was definitely set, but just to 8. Do we also need the 4 to drive the numa nodes correctly.

I did not have OMP_NESTED set. (but can try that)

Is that required in this situation, even if we're not nesting threads?

SergeyKostrov · ‎02-03-2017

>>...Is that required in this situation, even if we're not nesting threads? I wanted to get that information. My next question is if you tried to use OMP_PLACES and I_MPI_DEBUG ( set to level 5: I_MPI_DEBUG=5 ) environment variables?

Mike_B_2 · ‎02-03-2017

yes, we tried both cores and threads with OMP_PLACES, no joy

I have not tried I_MPI_DEBUG, but I'm assuming that's an Intel MPI variable, since we're using Cray MPICH, I'm assuming that MPI_DEBUG=5, would be ok. I can give that a shot once I can get onto some KNL SNC4 nodes.

jimdempseyatthecove · ‎02-03-2017

>>yes, we tried both cores and threads with OMP_PLACES, no joy

Did you set different places for each of the different ranks?

Jim Dempsey

Mike_B_2 · ‎02-03-2017

Jim, no, but there is only one MPI rank in this problem.

Do you mean a separate OMP_PLACES entry for each of the numa nodes? What would that look like?

OMP_PLACES=cores,cores,cores,cores

Mike_B_2 · ‎02-03-2017

Sergey, I don't seem to have a hwloc-dump* on either the front-end or on the KNL compute node. I see several other hwloc- variants, but no hwloc-dump

Is there some other way to get this info or run hwloc?

jimdempseyatthecove · ‎02-03-2017

>>but there is only one MPI rank in this problem

If you are only running 1 rank, then why run the application as an MPI application???

If running MPI with 1 rank .OR. non-MPI, then set the environment variable prior to issuing the command line (either to directly run the program or issuing mpiexec/mpirun). See: https://software.intel.com/en-us/node/522691#AFFINITY_TYPES
Note, when the KNL is configured as SNC4 or SNC2, although you have only 1 package, interpret the link documentation as if you have as many packages as you have nodes. For 0 or 1 ranks, consider using KMP_AFFINITY=scatter (then the 8 or whatever number of threads would be distributed equally).

======================

If you are running multiple ranks see: https://software.intel.com/en-us/node/528776 to see how you can use the ":" to specify different environment variables for each rank. IOW when configured as SNC4 and running 4 ranks, use different placement environment variables (on each side of the :'s)

Use the second long command-line syntax to set different argument sets for different MPI program runs. For example, the following command executes two different binaries with different argument sets:

$ mpiexec.hydra -f <hostfile> -env <VAR1> <VAL1> -n 2 ./a.out : \
-env <VAR2> <VAL2> -n 2 ./b.out

Note, in your case, you would run 1 process/rank on each node, same program name, but different -env environment variable settings.

Jim Dempsey

Mike_B_2 · ‎02-03-2017

The affinity test code is an MPI+OMP program. We just use one rank, when tests are focused on OMP thread placement. btw, all of this testing is only run on one node at time.

tt-login1 1036% setenv KMP_AFFINITY scatter
tt-login1 1037% aprun -n 1 -d 8 -j 1 -cc none -L 274 ./xthi_knl.intel
Hello from rank 0, thread 0, on nid00274. (core affinity = 0)
Hello from rank 0, thread 1, on nid00274. (core affinity = 1)
Hello from rank 0, thread 4, on nid00274. (core affinity = 36)
Hello from rank 0, thread 5, on nid00274. (core affinity = 37)
Hello from rank 0, thread 2, on nid00274. (core affinity = 18)
Hello from rank 0, thread 3, on nid00274. (core affinity = 19)
Hello from rank 0, thread 7, on nid00274. (core affinity = 21)
Hello from rank 0, thread 6, on nid00274. (core affinity = 20)

No affect with using scatter. In this run, 4 threads are placed on numa node 1, and no threads ended up on numa node 3.

Lawrence_M_Intel · ‎02-05-2017

Oh, this is Cray.

See if the attached affinity.docx helps.

aprun -n 1 -cc depth -j 4 -d 8

Will give you 1 rank with 2 cores (8 threads). Then set KMP_HW_SUBSET=1T to get 1 thread per core.

aprun -n 4 -N 1 -cc depth -j 4 -d 8 -S 4

should give you 4 ranks with 2 cores each, 1 rank per numa node

I did a little testing with aprun in snc4, but it takes forever to get a node on theta, so I didn't do much.

Andrey_C_Intel1 · ‎02-06-2017

Mike,

The OpenMP runtime in compiler 17 update 1 (the version you are using) does not aware of the NUMA nodes unfortunately, so it tries to use cores in linear manner as you mentioned. More functionality will be available in next versions of the Intel compiler starting 17 update 2 to be released soon.

I don't see general solution of your problem form OpenMP runtime side currently, besides eplicit affinity binding that has to be different on different systems, and thus does not look as a viable solution. As a workaround for three particular systems you mentioned in attached .xls table it is possible to use KMP_HW_SUBSET=32c@18, asking the library to skip first 18 cores until core numbering becomes regular for next 32 cores on all three systems. But this workaround may not work on other systems with yet different core numbering.

So my only hope for now is that Larry's suggestion would work for you. In future compiler releases it will be possible to specify nodes and tiles to be used by the runtime, first via KMP_HW_SUBSET environment variable.

Regards,
Andrey

Mike_B_2 · ‎02-06-2017

Larry, I looked at sections 3.2 and 6. I tried the example for aprun in section 6. I'm still getting empty numa nodes--no threads assigned.

OMP_NUM_THREADS=8
OMP_PROC_BIND=spread

tt-login1 1027% aprun -n 1 -cc none -j 4 -d 32 -L 276 ./xthi_knl.intel | sort -n -k 4
Hello from rank 0, thread 0, on nid00276. (core affinity = 0)         numa node 0
Hello from rank 0, thread 1, on nid00276. (core affinity = 156)                      1
Hello from rank 0, thread 2, on nid00276. (core affinity = 57)                        3
Hello from rank 0, thread 3, on nid00276. (core affinity = 195)                      3
Hello from rank 0, thread 4, on nid00276. (core affinity = 8)                          0
Hello from rank 0, thread 5, on nid00276. (core affinity = 146)                      0
Hello from rank 0, thread 6, on nid00276. (core affinity = 31)                        1
Hello from rank 0, thread 7, on nid00276. (core affinity = 169)                      1

I'm focused on a case with a single MPI rank, 8 OMP threads, one thread per core, single KNL node, running in SNC4 mode. Attempting to get even distribution of threads in all 4 numa regions/nodes.

Andrey, do you still want the output from the verbose and KMP_SETTINGS=1?

Your latest comment, indicates that the OMP runtime in 17.0.1 is not aware of numa nodes. We are seeking a generic solution that will work on any KNL node. As you point out the core mapping can be different on each KNL, so a hardcoded core list isn't appropriate--although it would of course allow us to run on a specific node.

If either of you had other tests that you think might work, I'm happy to run those.

Mike

SergeyKostrov · ‎02-06-2017

>>... I>>...'m focused on a case with a single MPI rank, 8 OMP threads, one thread per core, single KNL node, running in SNC4 mode. >> Attempting to get even distribution of threads in all 4 numa regions/nodes. Mike, Is that what you want to achieve? 1. KNL Server is set to modes: Cluster=SNC4 / MCDRAM=Hybrid50-50 2. [guest@c002-n002 WorkTest]$ mpiicpc -O3 -xMIC-AVX512 -qopenmp test12.c -o test12.out [guest@c002-n002 WorkTest]$ [guest@c002-n002 WorkTest]$ numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 node 0 size: 24450 MB node 0 free: 23343 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 node 1 size: 24576 MB node 1 free: 23879 MB node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 2 size: 24576 MB node 2 free: 23856 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 node 3 size: 24576 MB node 3 free: 23785 MB node 4 cpus: node 4 size: 2048 MB node 4 free: 1975 MB node 5 cpus: node 5 size: 2048 MB node 5 free: 1973 MB node 6 cpus: node 6 size: 2048 MB node 6 free: 1975 MB node 7 cpus: node 7 size: 2048 MB node 7 free: 1972 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 21 21 21 31 41 41 41 1: 21 10 21 21 41 31 41 41 2: 21 21 10 21 41 41 41 31 3: 21 21 21 10 41 41 31 41 4: 31 41 41 41 10 41 41 41 5: 41 31 41 41 41 10 41 41 6: 41 41 41 31 41 41 10 41 7: 41 41 31 41 41 41 41 10 [ guest@c002-n002 WorkTest]$ [guest@c002-n002 WorkTest]$ mpirun -host c002-n002 -np 1 -env OMP_NUM_THREADS=8 -env KMP_AFFINITY=granularity=fine,proclist=[0,1,16,17,48,49,32,33],explicit ./test12.out Sum: 1.5708 Completed in 321.13 secs [guest@c002-n002 WorkTest]$ 3. As you can see 8 OpenMP threads are used and bindings are as follows: OpenMP Threads 0 and 1 -> node 0 cpus: 0 1 ... OpenMP Threads 2 and 3 -> node 1 cpus: 16 17 ... OpenMP Threads 4 and 5 -> node 2 cpus: 48 49 ... OpenMP Threads 6 and 7 -> node 3 cpus: 32 33

SergeyKostrov · ‎02-06-2017

>>... >>...I'm focused on a case with a single MPI rank, 8 OMP threads, one thread per core, single KNL node, running in SNC4 mode. >> Attempting to get even distribution of threads in all 4 numa regions/nodes. In my test case two cores from every node are used.

Mike_B_2 · ‎02-06-2017

Sergey,

Yes, that's exactly the layout that I was after. For some reason when I used the proclist earlier, I thought it assigned by core id, and not the proc id, which made it different for each KNL (due to missing tiles).

I've repeated this with aprun on 2 different KNL nodes, and it works the same as what you show. Great.

Now, is this layout (distributed threads per numa region) possible without the proclist, in order to make it easier for users? So they don't have to check the numa node proc list, and have it as an env setting.

thanks much,

Mike

Lawrence_M_Intel · ‎02-06-2017

When you use the proclist you are completely overriding what aprun is doing and you need a different proclist for every MPI rank on a node.

Is my document TL;DR? The whole point is to get aprun to give you a CPU affinity mask including all the threads on all the cores in your MPI rank and then use OMP environment variables to subset that for nesting or different OpenMP thread layout.

It is true that with my document you can't run a single OpenMP program with 4 cores/2 theads per core with 1 core per numa node. But you really don't want to do that. You should run at least one MPI rank per numa node. Note also that on the 68 core part, two of the nodes have 9 cores and two have 8 cores.