Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Selecting custom victim in job scheduling on NUMA systems


I have a NUMA system. There is a thread for each core in the system. Threads that process similar data are assigned to the same node to reuse the data in the large L3 cache of the node. I want threads that are assigned to the same node should steal each other's jobs. If all jobs on a node have finished, these threads should steal jobs assigned to threads on other nodes. How can I implement this via OpenMP?

0 Kudos
1 Reply
Honored Contributor III

I do not know of a way to do what you want using OpenMP. You could use pthreads or other native threads, set affinities, figure out NUMA and cache topology, then roll your own thread scheduler.

An alternative, assuming you are C++, is to consider using QuickThread, it is GPL now. Try and search for QuickThread. Or I have reports of slow download times from my website (hosted on

QuickThread provides the programmer with the means of specifying cache locality as well as NUMA locality. The internal scheduler uses a two dimensional queue system. The enqueuing can specify collections of permitted threads as well as preferred threads. Dequeuing is prioritized:

L0 thread specific queue
L1 HT siblings
L2 Threads that share L2
L3 Threads that share L3
M0 Threads that share NUMA node
M1 Threads within one NUMA hop
M2 Threads within two NUMA hops
M3 Threads within three NUMA hops (iow all theads)

The programming is quite different from OpenMP. If you have a lot of code written then I suggest against using it. Expect some effort in setting up the build environment (paths, options, etc...). Once you have a simple program built and running, the learning curve is easier to mannage.

An example of a large matrix operation:


  OneEach_M0$, // across NUMA nodes
  [&](intptr_t iBegin, intptr_t iEnd) { // Lambda function using default references outside scope
    for(intptr_t iRow = iBegin; iRow < iEnd; ++iRow)
    { // slices of rows go to each NUMA node
      // nested, this section entered by 1 thread of each NUMA node
      // with different iBegin:iEnd slice of rows
        M0$,  // only threads within NUMA node of caller
        [&](intptr_t iBegin, intptr_t iEnd) { // Lambda function using default references outside scope
          for(intptr_t iCol=iBegin; iCol < iEnd; ++iCol)
            Array[iRow][iCol] = some work
          } // end for(intptr_t iCol
        }, // end inner Lambda function
        0, nCols // args for range for parallel_for
      ); // end inner parallel for
    } // end for(intptr_t iRow
  }, // end Lambda function of outer parallel_for
  0, nRows // args for range of outer parallel_for
); // end outer parallel_for

Jim Dempsey

0 Kudos