Re:Re: COARRAY process pinning bug

jimdempseyatthecove · ‎01-07-2021

Additional information.

In the test application, it performs affinity pinning following the above report.

It appears that an MKL call is not respecting the affinity pinning of the calling thread.

I suspect it is using the Process affinity as opposed to the thread (of the process) affinity making the call.

tempA(:,:) = InputA(:,:,I)  for Image            1
 xxxxxxxxxxxxxxxxxxxxx
 Image           1
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7
 HEEVR for Image            1
OMP: Warning #123: Ignoring invalid OS proc ID 8.
OMP: Warning #123: Ignoring invalid OS proc ID 16.
OMP: Warning #123: Ignoring invalid OS proc ID 24.
OMP: Warning #123: Ignoring invalid OS proc ID 32.
OMP: Warning #123: Ignoring invalid OS proc ID 40.
OMP: Warning #123: Ignoring invalid OS proc ID 48.
OMP: Warning #123: Ignoring invalid OS proc ID 56.
 OutputZ(:,:,I) = localZ(:,:) for Image            1
 OutputM(I) = M for Image            1
 OutputISUPPZ(:,I) = localISUPPZ(:) for Image            1
 End RunTest for Image            1
Total calc.time =     2.3700[sec]

 tempA(:,:) = InputA(:,:,I)[1]  for Image            2
 xxxxxxxxxxxxxxxxxxxxx
 Image           2
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 8 9 10 11 12 13 14 15
 HEEVR for Image            2
OMP: Warning #123: Ignoring invalid OS proc ID 0.
OMP: Warning #123: Ignoring invalid OS proc ID 16.
OMP: Warning #123: Ignoring invalid OS proc ID 24.
OMP: Warning #123: Ignoring invalid OS proc ID 32.
OMP: Warning #123: Ignoring invalid OS proc ID 40.
OMP: Warning #123: Ignoring invalid OS proc ID 48.
OMP: Warning #123: Ignoring invalid OS proc ID 56.
 OutputZ(:,:,I)[1] = localZ(:,:) for Image            2
 OutputM(I)[1] = M for Image            2
 OutputISUPPZ(:,I)[1] = localISUPPZ(:) for Image            2
 End RunTest for Image            2
Total calc.time =     4.3500[sec]

 tempA(:,:) = InputA(:,:,I)[1]  for Image            3
 xxxxxxxxxxxxxxxxxxxxx
 Image           3
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 16 17 18 19 20 21 22 23
 HEEVR for Image            3
OMP: Warning #123: Ignoring invalid OS proc ID 0.
OMP: Warning #123: Ignoring invalid OS proc ID 8.
OMP: Warning #123: Ignoring invalid OS proc ID 24.
OMP: Warning #123: Ignoring invalid OS proc ID 32.
OMP: Warning #123: Ignoring invalid OS proc ID 40.
OMP: Warning #123: Ignoring invalid OS proc ID 48.
OMP: Warning #123: Ignoring invalid OS proc ID 56.

The Process affinity (ProcessGroup 0) has all 64-bits assigned,

The calling Images (in this setup) are taking 8 logical processors (of group 0) each.
MKL is ignoring the calling Thread's affinity, and instead use the process affinity and are picking every 8th logical processor (this happens to be 1st thread in each L2).

Granted, for a single thread calling MKL (from a single image), I'd want it to choose every L2...
... however, with multiple images (on same SMP), the intention is to pin the calling thread to a subset of the process (e.g. the L2's of half of a NUMA node).

The choice of using adjacent OS proc ID's was a test to see what is happening in MKL.

Jim Dempsey

Barbara_P_Intel · ‎01-07-2021

I don't know how smart Intel MPI is regarding pinning. This article has information about how to pin processes to processors using CAF.

jimdempseyatthecove · ‎01-07-2021

Barbra thanks for the article link.

At issue is/are

when for using COARRAY's, you run the application (executable) as opposed to via mpiexec.

My (clumsy) attempt at using FOR_COARRAY_CONFIG_FILE led to the observation that the config file runs images as specified within the config file as opposed to the program.exe being launched via command line. (this is somewhat clumsy).

The use of I_MPI_PIN_PROCESSOR_LIST uses one processor in list as opposed to being able to specify multiple logical processors on the host to be assigned to a given rank. Note, it is (would be) useful for a rank to specify multiple host-logical processors to facilitate the use of the MKL threaded library (as well as OpenMP).

For example, it would be nice to be able to use (*** NOT IMPLIMENTED ***)

I_MPI_PIN_PROCESSOR_LIST={0:32},{64:32},{128:32},{192:64}

for 4 ranks, each rank "pinned" to half the OS procs of each of 4 NUMA nodes.
Then, as each rank calls MKL threaded library, having each rank's use of MKL stay within its pinned list of OS procs. BTW, the above is same to the OMP_PLACES syntax.

The issues I am experiencing are:
Inability to pin a rank to multiple processors, and when adjusting pinning within the rank, the failure of MKL to use the calling threads ThreadAffinity's for use for its instance of call into MKL. It uses the process's affinity mask and thus causes thrashing.

I will experiment with the config file to see what happens.

Jim Dempsey

jimdempseyatthecove · ‎01-08-2021

FOR_COARRAY_CONFIG_FILE=TestHEEVRK.MPI.config
FOR_COARRAY_NUM_IMAGES=8

TestHEEVRK.MPI.config:
-genv I_MPI_PIN_PROCESSOR_LIST=0,32,64,96,128,160,192,224 -n 8 .\TestHEEVRK\x64\DebugCAF\TestHEEVRK 500 8

Note, the above pin each process to 1 OS proc, spread every 32 OS procs (2 in each of the 4 processor groups)
However, the process is not pinned, rather the startup thread assumes the pinning.

Summary of pinning pre-repinning by process

Image 1
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 0

Image 2
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 32

Image 3
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 1 Pinned: 0

Image 4
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 1 Pinned: 32

Image 5
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 0

Image 6
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 32

Image 7
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 0

Image 8
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 32

vvvvvvvvv experiment 2 follows vvvvvvvvv

An experiment with using the OMP_PLACES syntax for the pinning:

-genv I_MPI_PIN_PROCESSOR_LIST=0:32,32:32,64:32,96:32,128:32,160:32,192:32,224:32 -n 8 .\TestHEEVRK\x64\DebugCAF\TestHEEVRK

This produces the same pinning as above (ignoring the speculative :32)

Removing the :32's

TestHEEVRK.MPI.config:
-genv I_MPI_PIN_PROCESSOR_LIST=0,32,64,96,128,160,192,224 -n 8 .\TestHEEVRK\x64\DebugCAF\TestHEEVRK 500 8

My pinning test program uses an augmented version of parsing OMP_PLACES

OMP_PLACES={0:32},{32:32},{64:32},{96:32},{128:32},{160:32},{192:32},{224:32}

Resulting in (omitting pinning info prior to augmentation), each main process thread is pinned as desired (note process thread and not process itself)

Image 1
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Image 2
Thread 0 GetCurrentThread() -2 ProcessorGroup 0 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Image 3
Thread 0 GetCurrentThread() -2 ProcessorGroup 1 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Image 4
Thread 0 GetCurrentThread() -2 ProcessorGroup 1 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Image 5
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Image 6
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Image 7
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Image 8
Thread 0 GetCurrentThread() -2 ProcessorGroup 3 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Now comes the crux of the problem

The test program makes an MKL call

What is problematic is:

1) MKL does not constrict the pinning to that of the calling thread. Instead it appears to use the system (all processor groups on Windows)
2) While MKL attempts to pin as it wishes to all processor groups (chosen procs within each group), it is stymied by the calling thread's pinning.
IOW MKL finds that it is only able to pin within the calling thread's pinned OS procs

For example Image 6 (note no warning about pinning to OS proc ID 160 which is processor group 2, pin 32)

Image 6
retGetProcessAffinityMask 1
ProcessAffinityMask FFFFFFFFFFFFFFFF
SystemAffinityMask FFFFFFFFFFFFFFFF
Thread 0 GetCurrentThread() -2 ProcessorGroup 2 Pinned: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
HEEVR for Image 6
OMP: Warning #123: Ignoring invalid OS proc ID 0.
OMP: Warning #123: Ignoring invalid OS proc ID 32.
OMP: Warning #123: Ignoring invalid OS proc ID 64.
OMP: Warning #123: Ignoring invalid OS proc ID 96.
OMP: Warning #123: Ignoring invalid OS proc ID 128.
OMP: Warning #123: Ignoring invalid OS proc ID 192.
OMP: Warning #123: Ignoring invalid OS proc ID 224.

Further complications using in the MPI config file:

-genv I_MPI_PIN_PROCESSOR_LIST=0,32,64,96,128,160,192,224 -genv KMP_AFFINITY=disabled -n 8 .\TestHEEVRK\x64\DebugCAF\TestHEEVRK 500 8

Using KMP_AFFINITY=disabled in hope that MKL would observe this setting, it does not.

The warnings still appear.

What this means is the end user will experience severe performance penalties when using the MKL threaded library when running:

multiple ranks (of the same mpiexec) on the same host
multiple mpiexec's with rank(s) to the same host (but with explicit pinning to avoid conflicts)
Within rank different OpenMP threads, with seperate affinities (plurel)

While some might not care about the last item, most should care about the first two.

Suggestive corrective measures:

Have mpiexec (hydra) change the affinity of the process as opposed to the affinity of the startup thread (then set the startup thread to the new process affinity).

Have MKL use the calling thread's affinity which results in the process affinity should the app/thread not re-pin .OR. be the calling thread's pinning as desired by the calling thread.

To handle the situation where someone is currently using OpenMP within rank .AND. pinning the OpenMP threads (default 1 OS proc/omp thread), and calling MKL (and thrashing with other ranks/processes), you could have an environment variable and/or mkl function that selects between process affinity or thread affinity (you choose default).

Jim Dempsey

jimdempseyatthecove · ‎01-08-2021

Paired down test program attached

System Xeon Phi 7210
Configured as 4 NUM nodes
64 cores, 4 threads per core.

The attached .zip is a project folder under Windows 10 x64 Pro
MS VS 2019 Version 16.8.3
Intel Parallel Studio XE Cluster Edition 2020u4

Using Debug build (use binary in ...\Debug folder)

SET_FOR_COARRAY_CONFIG_FILE=
SET_FOR_COARRAY_NUM_IMAGES=n (vary n)

After specifying number of images:

.\CAF_MKL_ISSUE\x64\Debug\CAF_MKL_ISSUE 50 16

The 50 is the size of the problem, the 16 is # times to execute the problem
assure # times exceed # images.

My system has (as configured)

4 ProcessorGroups
4 NUMA nodes
64 OS procs per Process Group (each group one NUMA node)

On this system, using number of images 1 through 4
Using more has problems:

C:\test\TestHEEVRK>set FOR_COARRAY_NUM_IMAGES=8

C:\test\TestHEEVRK>.\CAF_MKL_ISSUE\x64\Debug\CAF_MKL_ISSUE 50 16
 TestProgram
 TestProgram
 TestProgram

 TestProgram

 TestProgram

Image 1
 (3(A,I0),A)Array(          50 ,          50 ,          16 )
 BuildTestData() allocate data Image            1
 TestProgram

Image 5
 BuildTestData() allocate data Image            5
 TestProgram

Image 7
 BuildTestData() allocate data Image            7
 TestProgram

Image 3
 BuildTestData() allocate data Image            3

Image 8
 BuildTestData() allocate data Image            8

Image 6
 BuildTestData() allocate data Image            6
Image 4
 BuildTestData() allocate data Image            4
Image 2
 BuildTestData() allocate data Image            2
 BuildTestData() allocate done Image            2
 BuildTestData() allocate done Image            8
 BuildTestData() allocate done Image            4
 BuildTestData() allocate done Image            6
 BuildTestData() allocate done Image            1
 Building test data
 BuildTestData() allocate done Image            5
 BuildTestData() allocate done Image            7
 BuildTestData() allocate done Image            3
 Done Building test data
 Starting timed section
OMP: Warning #123: Ignoring invalid OS proc ID 32.
OMP: Warning #123: Ignoring invalid OS proc ID 64.
OMP: Warning #123: Ignoring invalid OS proc ID 96.
OMP: Warning #123: Ignoring invalid OS proc ID 128.
OMP: Warning #123: Ignoring invalid OS proc ID 160.
OMP: Warning #123: Ignoring invalid OS proc ID 192.
OMP: Warning #123: Ignoring invalid OS proc ID 224.
Image 1 Total calc.time =     0.2300[sec]
OMP: Warning #123: Ignoring invalid OS proc ID 0.
OMP: Warning #123: Ignoring invalid OS proc ID 64.
OMP: Warning #123: Ignoring invalid OS proc ID 96.
OMP: Warning #123: Ignoring invalid OS proc ID 128.
OMP: Warning #123: Ignoring invalid OS proc ID 160.
OMP: Warning #123: Ignoring invalid OS proc ID 192.
OMP: Warning #123: Ignoring invalid OS proc ID 224.
Image 2 Total calc.time =     1.4900[sec]
Image 7 Total calc.time =     1.5100[sec]
Image 5 Total calc.time =     1.5100[sec]
Image 4 Total calc.time =     1.5100[sec]
Image 3 Total calc.time =     1.5100[sec]
Image 6 Total calc.time =     1.5100[sec]
Image 8 Total calc.time =     1.5100[sec]
 All images Total calc.time =    1.50999450683594

Jim Dempsey

AbhishekD_Intel · ‎03-22-2021

Hi Jim,

Thanks for the in-depth details, we are looking into this issue and will get back to you with the updates.

Warm Regards,

Abhishek

MRajesh_intel · ‎06-24-2021

Hi,

Can you please let us know the oneAPI version used?

Regards

Rajesh.

MRajesh_intel · ‎06-28-2021

Hi,

We are closing this thread as we no longer support KNL machines. Please visit the system requirements for further information. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Link: https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html

Have a Good day.

Regards

Rajesh