Thread complexion(Multi-threading)

Masood_Ali_M_ · ‎09-25-2014

Hello everyone,

On the other day was trying to create a thread which could capture the working of an already existing(working) thread and copy its working. Setting priority of threads so that they can capture the working of the same priority level threads and also dynamic increase in the thread capacity to handle similar kind of work.

would appreciate if anybody could help with it.

Thanks.

-Ali

SergeyKostrov · ‎09-26-2014

>>...Setting priority of threads... It seems to me that you want to have some thread-cloning API. Is that correct? Also, this is simply to let you know that a priority change is Not recommended for OpenMP based threaded processing.

Bernard · ‎01-03-2015

>>>and copy its working.>>>

Are you referring to copying thread context and its stack?

Bernard · ‎01-03-2015

@Massood

Are you using Windows?

jimdempseyatthecove · ‎01-04-2015

Ali,

It might be helpful if you describe the nature of the problem that you think you technique will be helpful in solving.

Due to hardware threads, residing in hardware cores (1, 2, 4, more per core), and multiple cores within a processor, and on larger systems, multiple processors per motherboard, and on larger systems, multiple motherboards connected in a large SMP system (and on larger systems, multiple such SMP systems interconnected as a large cluster),... and as you migrate from:

threads within same core
to cores within same CPU
to CPUs on same motherboard
to SMP interconnected motherboards
to cluster

both the latency time to access data and/or migrate data increases.

Hardware threads within same core share the core's cache (typically L1 and L2)
Cores within the same CPU (socket) typically share the same L3 cache (this may also be the Last Level Cache)
CPU's on the same motherboard, depending on design, may have/share a Last Level Cache (LLC), or may be able, at longer latency, access the L3 of other CPUs on the same motherboard without going through RAM.

Therefore, it is not usually beneficial to context switch the working state of a software thread amongst different hardware threads. In some situations it is beneficial to do so. Each computational problem is different.

Please note, if your multi-threaded program is written to NOT use affinity pinning of its software threads, that the (most) operating systems perform hardware thread migration for you automatically. Different operating system, when undirected, behave differently. Most operating systems permit an application to specify an arbitrary set of logical processors (to the SMP) on which an application is permitted to run.

An example of this could conceivably be, at program start, your single thread portion of initialization code can query the system for the least used socket on the system (how you do this depends on operating system), then your single threaded initialization code can instruct the operating system to migrate the current thread to, and constrict the current thread to those logical processors within the socket. Next the initialization code would instruct the operating system than all subsequently created threads for the process (application) are also to be restricted to the same logical processors of socket. Then the program can go about creating whatever number of threads it deems as necessary within the same socket.

Your actual requirements may expand upon the above principal, thus having different groupings of threads.

For what it is worth, I created such a threading toolkit (QuickThread), that found few interested parties. This toolkit permitted the parallel_xxx constructs to specify the proximity of the thread team to use. Example:

parallel_for(Waiting_L3$, 0, nObjects, ObjectArray) {

The above would CPU (socket) that had the least number of threads running (by its process), and assign a thread team consisting of all the threads of the process that tied to that CPU.

With relatively simple classification you could construct advantageous thread teaming:

// Distribute tile by tile to each L2 cache
 parallel_for_each(OneEach_L2$, 0, nTiles, DoTile);
… // continue processing if desired
WaitTillDone();
…

// Function to distribute work for tile,
// one row at a time
// restricted to threads sharing current thread’s L2 cache
void DoTile(intptr_t, iTile)
{
 intptr_t iRow = 0; // signature place holder
 parallel_for_each(L2$, 0, nRowsPerTile, DoRow, iRow, iTile);
}
// Same function to process one row of a tile for both techniques
void DoRow(intptr_t iRow, intptr_t iTile)
{
  // process one row of tile
 …
}

or using C++ Lambda format:

double**  array2D = Array2DAllocate( nRows, nCols);
…
//  slice rows by NUMA node
parallel_for(
  OneEach_M0$,
  0, nRows,
  [&](intptr_t BeginRow, intptr_t EndRow)
  {
     // now slice our slice by threads within our NUMA node
     parallel_for(
            M0$,
            BeginRow, EndRow,
            [&](intptr_t iBegin, intptr_t iEnd)
            {
                 // now process our rows (slice of the slice)
                 for(intptr_t iRow = iBegin; iRow < iEnd; ++iRow)
                 {
                       doWork(array2D[iRow]);
                 }
             }
});

Jim Dempsey