Solved: Pinning processes to specific cores?

ClayB · ‎03-05-2018

I'm wondering if Intel MPI has the facility to allow me to pin processes, not only to specific nodes within my cluster, but to specific cores within those nodes. With Open MPI, I can set up a rankfile that will give me this fine-grained capability, that is, I can assign each MPI process rank to a specific node and a given core on that node (logical or physical).

Granted, the rankfile idea from Open MPI is merely theoretical since the OS I have on the machines doesn't seem to abide by the assignments I make, but at least the possibility is there. I haven't found that level of control within the Intel MPI documentation. Any pointers or is this all just a pipe dream on my part?

ClayB · ‎03-07-2018

Clay -

Have you tried looking in the Intel® MPI Library Developer Reference for Linux* OS (https://software.intel.com/en-us/mpi-developer-reference-linux)? The title is a bit misleading; it might be interpreted as a reference for folks that are developing the Intel MPI Library rather than developers using the Intel MPI Library. It makes it easy to overlook.

There is a subsection on "Process Pinning" in the "Tuning Performance" chapter. This describes all kinds of pinning methods through environment variables (like I_MPI_PIN_MODE and I_MPI_PIN_PROCESSOR_LIST). Surely there's something there that you can finagle to your needs.

Hope that helps.

--clay

View solution in original post

ClayB · ‎03-07-2018

Clay -

Have you tried looking in the Intel® MPI Library Developer Reference for Linux* OS (https://software.intel.com/en-us/mpi-developer-reference-linux)? The title is a bit misleading; it might be interpreted as a reference for folks that are developing the Intel MPI Library rather than developers using the Intel MPI Library. It makes it easy to overlook.

There is a subsection on "Process Pinning" in the "Tuning Performance" chapter. This describes all kinds of pinning methods through environment variables (like I_MPI_PIN_MODE and I_MPI_PIN_PROCESSOR_LIST). Surely there's something there that you can finagle to your needs.

Hope that helps.

--clay

ClayB · ‎03-08-2018

Thanks, Clay. That does seem like it would help me in what I'm trying to do.

Unfortunately, it's not working as advertised. I've set the environment variable I_MPI_PIN to 1 in order to enable the Intel MPI library to be able to pin processes. Then, I've set I_MPI_PIN_PROCESSOR_LIST to a specific set of 12 (logical) core ids. When I run the code and watch the assignment of cores using 'top' in another window, all processes are executing on cores 0-11. I even looked to see what the MPI debug output was saying about where it thought things were being placed. It looks like this:

[clay@XXX src]$ export I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,12,13,14,15,16,17
[clay@XXX src]$ mpirun -genv I_MPI_DEBUG=4 -m -n 12 graph500_reference_bfs 22
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[8] MPI startup(): shm data transfer mode
[9] MPI startup(): shm data transfer mode
[10] MPI startup(): shm data transfer mode
[11] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name              Pin cpu
[0] MPI startup(): 0       27963    qc-2.oda-internal.com  0
[0] MPI startup(): 1       27964    qc-2.oda-internal.com  1
[0] MPI startup(): 2       27965    qc-2.oda-internal.com  2
[0] MPI startup(): 3       27966    qc-2.oda-internal.com  3
[0] MPI startup(): 4       27967    qc-2.oda-internal.com  4
[0] MPI startup(): 5       27968    qc-2.oda-internal.com  5
[0] MPI startup(): 6       27969    qc-2.oda-internal.com  12
[0] MPI startup(): 7       27970    qc-2.oda-internal.com  13
[0] MPI startup(): 8       27971    qc-2.oda-internal.com  14
[0] MPI startup(): 9       27972    qc-2.oda-internal.com  15
[0] MPI startup(): 10      27973    qc-2.oda-internal.com  16
[0] MPI startup(): 11      27974    qc-2.oda-internal.com  17

This looks good, but when I "look behind the curtain" I feel that I've been ZARDOZ'ed and the report is not reflective of reality. I even tried one of the "standard" layouts that come with the environment variables.

[clay@XXX src]$ export I_MPI_PIN_PROCESSOR_LIST=allcores:map=scatter
[clay@XXX src]$ mpirun -genv I_MPI_DEBUG=4 -m -n 12 graph500_reference_bfs 22
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[8] MPI startup(): shm data transfer mode
[9] MPI startup(): shm data transfer mode
[10] MPI startup(): shm data transfer mode
[11] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name              Pin cpu
[0] MPI startup(): 0       28192    qc-2.oda-internal.com  0
[0] MPI startup(): 1       28193    qc-2.oda-internal.com  6
[0] MPI startup(): 2       28194    qc-2.oda-internal.com  12
[0] MPI startup(): 3       28195    qc-2.oda-internal.com  18
[0] MPI startup(): 4       28196    qc-2.oda-internal.com  1
[0] MPI startup(): 5       28197    qc-2.oda-internal.com  7
[0] MPI startup(): 6       28198    qc-2.oda-internal.com  13
[0] MPI startup(): 7       28199    qc-2.oda-internal.com  19
[0] MPI startup(): 8       28200    qc-2.oda-internal.com  2
[0] MPI startup(): 9       28201    qc-2.oda-internal.com  8
[0] MPI startup(): 10      28202    qc-2.oda-internal.com  14
[0] MPI startup(): 11      28203    qc-2.oda-internal.com  20

This looks pretty good since the cores 0 and 6 are on different sockets while core 12 is on the same socket as core 0, but is as "far" as it could be from the other two, and so on (as I learned from here). Again, watching the placement from another window, all processes crowd into cores 0-11. Definitely not scattered throughout the available 24 cores.

I'm wondering if the operating system (Linux kernel 3.10.0-693.11.6.el7.x86_64; CentOS Linux release 7.4.1708) is not paying any attention to the Intel MPI library (as it did with the Open MPI runtime attempts to pin processes to cores).

Anyone with other thoughts? Am I missing something obvious?

P.S. I've found a way to fix the core of a process, but this is done by modifying the application to use sched_affinity() with the core mask based on the rank of the process and sent from the rank 0 process. It would be better to not have to modify code like that and control everything from the command line or environment.

McCalpinJohn · ‎03-12-2018

We have not seen Intel MPI mess up the core binding in this way -- in fact, we are extremely pleased that the defaults are almost always exactly what we want. But I don't think that I tested it on a node that has the perverse numbering you describe in https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/752318

Have you used sched_getaffinity() inside the code to confirm the masks that MPI is setting for each task?

ClayB · ‎03-12-2018

I've not tried sched_getaffinity() to double check the actual core mappings. I'll give it a go.

I was trusting the output of "top" showing what cores were being used to execute processes. When I use sched_setaffinity() to manually place processes, the top output shows me what I expect the pinning to be, even within the perverse core numbering. If I see the Intel MPI debug output declaring the mapping I requested, I expect the same visual confirmation from top. While I might gotten the numbering mixed up and inadvertently chose the 12 cores that are numbered 0-11 within top, I would have really expected the "map=scatter" alternative, with it's wide-ranging selection of cores, to not choose the logical/physical cores numbered 0-11 by the 'top' utility.

As you suggest, John, maybe I need to look into the code itself for confirmation rather than rely on external tools. I'll post an update on what I find.

ClayB · ‎03-12-2018

No soap. Getting confirmation from sched_getaffinity() that the cores are being "pinned" to cores equal to their rank. Running with a specific list of cores:

[clay@XXX src]$ mpirun -genv I_MPI_DEBUG=4 -m -host qc-2.oda-internal.com -env I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,18,19,20,21,22,23 -n 12 graph500_reference_bfs 22
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[8] MPI startup(): shm data transfer mode
[9] MPI startup(): shm data transfer mode
[10] MPI startup(): shm data transfer mode
[11] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name              Pin cpu
[0] MPI startup(): 0       5559     CxxxCxxxxxxxC.com  0
[0] MPI startup(): 1       5560     CxxxCxxxxxxxC.com  1
[0] MPI startup(): 2       5561     CxxxCxxxxxxxC.com  2
[0] MPI startup(): 3       5562     CxxxCxxxxxxxC.com  3
[0] MPI startup(): 4       5563     CxxxCxxxxxxxC.com  4
[0] MPI startup(): 5       5564     CxxxCxxxxxxxC.com  5
[0] MPI startup(): 6       5565     CxxxCxxxxxxxC.com  18
[0] MPI startup(): 7       5566     CxxxCxxxxxxxC.com  19
[0] MPI startup(): 8       5567     CxxxCxxxxxxxC.com  20
[0] MPI startup(): 9       5568     CxxxCxxxxxxxC.com  21
[0] MPI startup(): 10      5569     CxxxCxxxxxxxC.com  22
[0] MPI startup(): 11      5570     CxxxCxxxxxxxC.com  23
...
Rank 0 on core 0
Rank 1 on core 1
Rank 2 on core 2
Rank 3 on core 3
Rank 4 on core 4
Rank 5 on core 5
Rank 6 on core 6
Rank 7 on core 7
Rank 8 on core 8
Rank 9 on core 9
Rank 10 on core 10
Rank 11 on core 11

Or trying with the preset "allcores:map=scatter" gives me the same thing:

[clay@XXX src]$ mpirun -genv I_MPI_DEBUG=4 -m -host qc-2.oda-internal.com -env I_MPI_PIN_PROCESSOR_LIST=allcores:map=scatter -n 12 graph500_reference_bfs 22
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[8] MPI startup(): shm data transfer mode
[10] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[9] MPI startup(): shm data transfer mode
[11] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name              Pin cpu
[0] MPI startup(): 0       5781     Cxxx.CxxxxxxxC.com  0
[0] MPI startup(): 1       5782     Cxxx.CxxxxxxxC.com  6
[0] MPI startup(): 2       5783     Cxxx.CxxxxxxxC.com  12
[0] MPI startup(): 3       5784     Cxxx.CxxxxxxxC.com  18
[0] MPI startup(): 4       5785     Cxxx.CxxxxxxxC.com  1
[0] MPI startup(): 5       5786     Cxxx.CxxxxxxxC.com  7
[0] MPI startup(): 6       5787     Cxxx.CxxxxxxxC.com  13
[0] MPI startup(): 7       5788     Cxxx.CxxxxxxxC.com  19
[0] MPI startup(): 8       5789     Cxxx.CxxxxxxxC.com  2
[0] MPI startup(): 9       5790     Cxxx.CxxxxxxxC.com  8
[0] MPI startup(): 10      5791     Cxxx.CxxxxxxxC.com  14
[0] MPI startup(): 11      5792     Cxxx.CxxxxxxxC.com  20
...
Rank 0 on core 0
Rank 1 on core 1
Rank 2 on core 2
Rank 4 on core 4
Rank 5 on core 5
Rank 7 on core 7
Rank 8 on core 8
Rank 3 on core 3
Rank 6 on core 6
Rank 9 on core 9
Rank 10 on core 10
Rank 11 on core 11

One thing I was asked to try was toggling the I_MPI_PIN_MODE environment variable between its two setting (pm, lib). I did that, but no changes from whichever is the default.

So, back to the drawing board?

ClayB · ‎03-12-2018

Just for completeness, here's the code snippet I added after the MPI_Init() and MPI_Comm_rank() calls were done, to generate the actual core pinning numbers.

/* CPB - Test core pinning with sched_getaffinity() */
        cpu_set_t mask1, mask2;
        int iMask;
        CPU_ZERO(&mask1);
        sched_getaffinity(0, sizeof(mask1), &mask1);
        for (iMask = 0; iMask < 24; ++iMask) {
                CPU_ZERO(&mask2);
                CPU_SET(iMask, &mask2);
                if (CPU_EQUAL(&mask1, &mask2)) {
                        fprintf(stderr, "Rank %d on core %d\n", rank, iMask);
                        break;
                }
        }

McCalpinJohn · ‎03-12-2018

Hmmm... it looks like your system really likes that particular binding pattern...

Sounds like you need an Intel MPI developer to look at this.

Can you boot your system with HyperThreading enabled to see if the logical processor numbering scheme is more sensible in that case?

ClayB · ‎03-14-2018

STUPID!

stupid stupid

stupid stupid stupid

stupid stupid stupid stupid

Not ZARDOZ, it seems, but Penn & Teller. A little bit of misdirection had me looking in the wrong place for a culprit.

When I tried using a simple infinite loop application and launching multiple copies through mpirun and the process pinning variables, everything worked as expected. If I put an MPI_Init() and MPI_Finalize() calls before and after the loop, the processes are scheduled as desired. When I try my application of interest (which I had not written), things revert to the first N cores of the processor regardless of what I ask for pinning.

I finally went to one of tried and true debugging methods and asked someone to look over my shoulder and explain things while he questioned each assumption. He pointed to a file that was touted as an "Active Message Layer" library. There, buried in the init() function was an explicit call to pthread_setaffinity_np() with all the attendant set up lines before it. Thus, whatever initial pinnings I may have wanted or requested from MPI, those were overridden by the AML.

And I must say that I saw this happen when I was watching 'top'; as the code began execution there seemed to be some flicker of the processes starting on the MPI-assigned cores before they all migrated to another set of (default) cores. Commenting out the affinity lines, recompiled, and all was right again. D'oh!

I should have remembered my Sherlock Holmes: "...when you have eliminated the impossible, whatever remains, however improbable, must be the truth" (Sign of Four, Ch. 6) or "There is nothing more deceptive than an obvious fact" (The Boscombe Valley Mystery).

stupid stupid stupid stupid stupid

McCalpinJohn · ‎03-14-2018

Naughty computers....

I seem to remembering falling victim to something similar when I was trying to use LD_PRELOAD to set up the affinity as early as possible in the process. After my code, the runtime library looked in the environment and replaced my settings with its own...

Just goes to show that you really should not run software. Especially not other people's software. Entering the raw binary machine code through front panel toggle switches provides much better control.