Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

COARRAY process pinning bug

jimdempseyatthecove
Honored Contributor III
619 Views

I am experimenting with COARRAYs on KNL with 64 cores, 256 logical processors on a Windows system. Being Windows, this means there are 4 Processor Groups, each with 64 (group) logical processors. These groups conveniently represent each of the 4 NUMA nodes.

The (my) expectation is the implicit behavior of mpiexec would use the FOR_COARRAY_NUM_IMAGES value and partition the system available logical processors in a meaningful (optimal) manner. The experimental program was exhibiting odd behavior, so I added code to query the system as to where each image was located, and how it was pinned.

With FOR_COARRAY_NUM_IMAGES=4 I would expect each of the 4 images to occupy one of the 4 NUMA nodes. (64 threads each in different nodes). This usually the case sometimes I see two images in the same node.

With FOR_COARRAY_NUM_IMAGES=8 I would expect each of the 8 images to occupy one half of the 4 NUMA nodes. (32 threads each in different half nodes). This is not the case.

C:\test\TestHEEVRK>set FOR_COARRAY_NUM_IMAGES=8

C:\test\TestHEEVRK>.\TestHEEVRK\x64\DebugCAF\TestHEEVRK 500 8
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1
 TestHEEVRK #1

DisplayThreadInfo() Image 1
 xxxxxxxxxxxxxxxxxxxxx
 Image           1
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 2
 xxxxxxxxxxxxxxxxxxxxx
 Image           2
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 3
 xxxxxxxxxxxxxxxxxxxxx
 Image           3
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 4
 xxxxxxxxxxxxxxxxxxxxx
 Image           4
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 5
 xxxxxxxxxxxxxxxxxxxxx
 Image           5
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 6
 xxxxxxxxxxxxxxxxxxxxx
 Image           6
Thread 0 GetCurrentThread() -2 ProcessorGroup 1 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 7
 xxxxxxxxxxxxxxxxxxxxx
 Image           7
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

DisplayThreadInfo() Image 8
 xxxxxxxxxxxxxxxxxxxxx
 Image           8
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 GetCurrentProcess()                    -1
retGetProcessAffinityMask           1
ProcessAffinityMask       FFFFFFFFFFFFFFFF
SystemAffinityMask       FFFFFFFFFFFFFFFF
retGetProcessGroupAffinity           1
 GroupCount      4
 GroupArray      0      1      2      3
Stopping here

C:\test\TestHEEVRK>

 

In the above, Images 1 and 2 are pinned properly, each occupying half (32 different threads) of ProcessorGroup 0 (NUMA node 0).

Images 3:8, they each have 64 threads (full processor group), round robin distribution starting at ProcessorGroup 2, as opposed to Processor Group 1.

The problem with this is you now have the likely potential of images preempting one another as well as unnecessary higher numbers of cache evictions.

The ideal pinning for this configuration would have been:

Image 1: group 0, group logical processors 0:31
Image 2: group 0, group logical processors 32:63
Image 3: group 1, group logical processors 0:31
Image 4: group 1, group logical processors 32:63
Image 5: group 2, group logical processors 0:31
Image 6: group 2, group logical processors 32:63
Image 7: group 3, group logical processors 0:31
Image 8: group 3, group logical processors 32:63

Jim Dempsey

0 Kudos
0 Replies
Reply