Solved: Core pinning with pthread

Jun_Hyun_S_ · ‎11-21-2014

Hi all, I did some follow-up on my last topic https://software.intel.com/en-us/forums/topic/533886

Long story short, I'm trying to emulate SIMD behavior in MIC environment,

and I actually decided to implement the offloaded segment in native code without OpenMP pragmas,

which means I'm implementing thread pinning to individual cores with pthread_attr_setaffinity_np.

The logic is all there from start to finish. Here's what I did:

1. I got the number of logical cores using sysconf (_SC_NPROCESSORS_ONLN), which returns 240 (a 60-core 5110P).
2. Based on the parameters [range of cores to utilize] and [number of active threads per core], I pin the necessary number of threads (computed from another user parameter) to the respective cores using pthread_attr_setaffinity_np().
(I even did the rotate-by-one trick to make sure the pinning starts from the first thread of the first core, so when I say logical core 0, it does mean thread 0 of core 0)
3. Start infinite loop where spawned threads are in an infinite loop with synchronization per each iteration with the parent process using pthread_barrier_wait()

The current problem lies in step 2. I've created a pool of 'pthread_attr_t's for each pinning to logical core (240 attr variables), and when I try to initialize them with pthread_attr_setaffinity_np(), it returns error code 22: Invalid argument.

It is important to mention that it does not fail from the get go. The pthread_attr_setaffinity_np() call runs smoothly from logical core 0 to logical core 106, but it fails on initialization for logical core 107.

The man page for pthread_attr_setaffinity_np() suggests that EINVAL is returned when the CPU set is out of range and the CONFIG_NR_CPUS is in charge of the range. Another search told me that the variable can be checked in file /boot/config-`uname -r`
AFAIK, however, there was no way I could check that for my 5110P.

Any information as to how I should continue would be greatly appreciated.
Thanks for your attention.

Jun

McCalpinJohn · ‎11-21-2014

I don't think that the Xeon Phi products ever change their physical to logical core mapping, but it would be hard to prove "never"....

I have not used the pthread_set_affinity_mp() interface, but I have used the sched_setaffinity() interface with OpenMP programs and it works fine once you learn "the trick". In OpenMP programs all of the OpenMP threads get the same value from "getpid()", so this can't be used for setting affinity. Fortunately you can use a zero as the first argument to sched_getaffinity() or sched_setaffinity() and it will work for the current thread -- even in OpenMP programs.

View solution in original post

jimdempseyatthecove · ‎11-21-2014

It is hard to tell from a distance as to what is going on.

The first thing to look at is to verify that the affinity bitmask length your code is using has sufficient number of bits. Your symptoms may occur if your affinity bitmask length is too short (e.g. 64 bits). There is a pthread system call to get the number of bits required for your logical processor bitmask.

>>rotate-by-one trick

The O/S can map the logical processors anyway it chooses. The fact that you observe it one way at one time does not mean it will be that way the next time. Therefore, do not blindly assume it will be as observed once. A better technique is to use CPUID (and CPUIDEX) to obtain the APIC and/or APIC2 numbers of the thread that issues the instruction. Then construct a thread association table based on each pinned thread. IOW, (this may be an overkill) the location process is:

a) make pthread call to obtain the affinity bit mask size and make adjustments accordingly
b) obtain the process affinity bit mask *** this may be a subset of system logical processors ***
c) start as many threads as you see logical processors set in the process affinity bit mask
d) each thread assumes a thread ID (sequenced from 0:nThreads-1) in order in which it was started
e) Using its thread ID, each thread searches the process affinity bitmask for the thread ID'th set bit (not necessarily the thread ID'th bit)
f) The thread affinity pins itself to bit found in step e) and assumes ownership of this logical processor
g) After pinning, use CPUID and/or CPUIDEX to obtain the APIC and/or APIC2 numbers and cache associations and save in threadID'th context)
h) Perform barrier until all threads to be started (number of bits set in process bitmask) reach this point
i) Using information from g) construct associations NUMA Node, Socket, LLC, core, L2, L1
j) Perform barrier until all threads to be started (number of bits set in process bitmask) reach this point
k) Using the association tables now constructed setup teams. Note, do not assume all association groupings have same number of associations. Example, on MIC determine what action to take should one core see 3 HTs and the remainder see 4 HTs. (not that you will ever see this, the possibility does exist, and example of this might be an O/S real-time process sequestering one system logical processor from all other processes). This step may have the possibility of reducing the number of participants in each grouping.

After the above, you will now have a set of (to your process) logical NUMA node numbers/teams, logical Socket numbers/teams, logical LLC numbers/teams, logical core numbers/teams, logical L2 numbers/teams, logical L1 numbers/teams. As to if your application pays attention to anything other than the logical core numbers/teams and logical L2 numbers/teams is up to you. Gathering all the associations now may help you in the future.

***

It is not unusual today for systems to employ virtualization technology where the entire system set of physical CPU(s), together with all its cores/threads, is virtualized into subsets that appear as a virtual system complete with virtual CPU(s), together with all its cores/threads. If running on a virtualized system, the above process should be able to ascertain the associations properly. However, the way the microkernel virtualized may affect the number of threads available to each team within an association. Also, although this may not be practiced today, the microkernel could potentially on-the-fly reorganize the virtualizations. When this is a concern, each thread should periodically check the APIC/APIC2 numbers to see if the changed, and if so, set an application flag indicating the system has been reorganized. This flag can then be consulted to see if your application should reorganized its teams.

Jim Dempsey

jimdempseyatthecove · ‎11-21-2014

FWIW a little historical background on virtualization.

In the 1980's I helped design a processor, and I wrote the operating system for this processor, whereby a system could be constructed with up to 8 of these processors, each with local RAM, was connected to a common I/O bus, and independently connected to a shared RAM. IOW a cluster where each CPU node had both local RAM and shared RAM and all CPU nodes had a common I/O bus. One of the nodes would assume ownership of the I/O bus.

The operating system was distributed across all CPU nodes with the knowledge that one of them had the responsibility of performing the I/O for all the CPU nodes. Each CPU node's operating system would construct a Virtual CPU for each process running in the physical CPU (sound familiar). A process (running in a Virtual CPU) once started had a "sticky" affinity to the physical CPU in which it was started. However, the physical CPU that performed the I/O would also periodically check the processing loads on all the physical CPUs. If the loads seemed imbalanced, the controlling (I/O CPU) O/S would instantiate a process copy from on CPU to another. The running process being moved would not be aware of this action (other than by observing a time latency between sections of code).

On todays systems with Virtualization, this would be an example of the microkernel restructuring the virtual system.

Jim Dempsey

McCalpinJohn · ‎11-21-2014

I don't think that the Xeon Phi products ever change their physical to logical core mapping, but it would be hard to prove "never"....

I have not used the pthread_set_affinity_mp() interface, but I have used the sched_setaffinity() interface with OpenMP programs and it works fine once you learn "the trick". In OpenMP programs all of the OpenMP threads get the same value from "getpid()", so this can't be used for setting affinity. Fortunately you can use a zero as the first argument to sched_getaffinity() or sched_setaffinity() and it will work for the current thread -- even in OpenMP programs.

James_C_Intel2 · ‎11-24-2014

In OpenMP programs all of the OpenMP threads get the same value from "getpid()", so this can't be used for setting affinity. Fortunately you can use a zero as the first argument to sched_getaffinity() or sched_setaffinity() and it will work for the current thread -- even in OpenMP programs.

This is all standard Linux OS behaviour, which has nothing specifically to do with OpenMP. OpenMP threads are pthreads, which are (in turn) normal linux threads created by the "clone" system call. Since they are all threads inside the same process they naturally return the same value as a result of getpid(), You can find out their thread id by using the gettid() system call, however libc has no support for this, so you have to find the system call number and then use "syscall" (see http://man7.org/linux/man-pages/man2/gettid.2.html for instance).

As you point out, however, there's probably no need to do this, since it\s easier just to have the thread set or get its own affinity, for which the magic value of zero as the threadid works just fine.

McCalpinJohn · ‎11-24-2014

Thanks, James! I have done a pretty good job of avoiding having to learn about pthreads, but it is good to know that "the trick" should work there too. (I tried to learn about all of this back when Linux was trying to transition from the earlier thread library to NPTL, with documentation that varied between confused and wrong.)

The last time I tried anything other than "the trick", I found that the value returned by gettid() could not be used for the sched_*affinity() calls, and a colleague of mine had a similar experience recently. On the other hand, the man page for sched_setaffinity() on my RHEL 6.5 (kernel 2.6.32-431) say that it is supposed to work.

The man page for sched_setaffinity on my RHEL 6.5 machine also says that POSIX threads should use pthread_setaffinity_np() rather than sched_setaffinity(), but it does not explain why. The pthread_setaffinity_np() call uses the output of pthread_self() to specify the thread, and the man page for pthread_self() says that what it returns is not the same as what gettid() returns.

Yup -- this is still confusing.

Jun_Hyun_S_ · ‎12-03-2014

Hi all, Thank you all for providing valuable insights, especially Jim, for posting such comprehensive guidelines.

The problem before was that my cpu_set_t wasn't large enough to hold all 240 logical processors,
so I simply used dynamic allocation with pointer value CPU_ALLOC(N) instead of local declaration
and replaced all set operations with [set_operation]_S, which resolved the issue.

I have spawned N=240 threads pinned in the order of logical processor indices (0~239),
and called cpuid with eax=1 where ((unsignedchar*)&ebx)[3] gets you APIC id.
Turns out the ith element of cpu_set_t exactly corresponds to 'processor: i' of the /proc/cpuinfo.
It also confirmed that the logical processor ID is off by one and that jth thread of the ith core is
the [ (i * 4 + j + 1) % N ]th logical processor.

With that resolved, I have run into a new problem.

In order to synchronize the main thread with K worker threads performing partial lookups,
I have used pthread_barrier_wait with K+1 threads. However, pthread_barrier_wait basically employs
sleep operations and reschedules the waiting threads, which is costly performance-wise.
So in order to implement busy waits, I have introduced two atomic variables whose type
I have discovered in the <composer_xe_2015_directory>/compiler/include/mic/atomicint.h

The codes are as follows:

1) The worker thread

    goto loop_begin;
    while (true) {
        // Vector processing logic
        for (int index = 0; index < num_packets; index += 16) {
            __m512i     v_daddr =           _mm512_load_epi32 (inputbuf + index);
            __m512i     v_daddr_shift8 =    _mm512_srli_epi32 (v_daddr, 8); 
            __m512i     v_temp_dest =       _mm512_i32extgather_epi32 (v_daddr_shift8, TBL24, 
                                                                    _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
            __m512i     v_ignored_ip =      _mm512_set_epi32 (REPEAT_16(IGNORED_IP));
            __m512i     v_zero =            _mm512_setzero_epi32 (); 
            __mmask16   m_is_not_ignored =  _mm512_cmp_epu32_mask (v_daddr, v_ignored_ip, _MM_CMPINT_NE);
            __m512i     v_0x8000 =          _mm512_set_epi32 (REPEAT_16(0x8000));
            __m512i     v_0x7fff =          _mm512_set_epi32 (REPEAT_16(0x7fff));
            __m512i     v_0xff =            _mm512_set_epi32 (REPEAT_16(0xff));
            __mmask16   m_top_bit_set =     _mm512_cmp_epu32_mask (_mm512_and_epi32 (v_temp_dest, v_0x8000), 
                                                                    v_zero, _MM_CMPINT_NE);
            __mmask16   m_both_cond_met =   _mm512_kand (m_is_not_ignored, m_top_bit_set);
            __m512i     v_index2 =          _mm512_add_epi32 (_mm512_slli_epi32 (_mm512_and_epi32 (v_temp_dest, v_0x7fff), 8), 
                                                            _mm512_and_epi32 (v_daddr, v_0xff));
            __m512i     v_result =          _mm512_mask_i32extgather_epi32 (v_temp_dest, m_both_cond_met, v_index2, 
                                                            TBLlong, _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
                                            _mm512_mask_extstore_epi32(resultbuf, m_is_not_ignored, v_result, 
                                                            _MM_DOWNCONV_EPI32_UINT16, _MM_HINT_NT);
        }   
        // ################## BEGIN PROBLEM AREA ###################
        while ( nthreads_running.load() != 0 );  
            // At this point all threads have at least started processing their vectors
            // This loop prevents threads that finished early from incrementing the wait counter 
loop_begin:
        ++nthreads_waiting;
        while ( nthreads_waiting.load() != 0 );
            // prevent threads from starting without main thread's permission
        ++nthreads_running;
        // ################## END PROBLEM AREA #####################
    }

2) The main thread

        // Some profile report
ipv4_mainloop_begin:
        ts_preread = elapsedTime();
        //FIXME: SCIF READ OPERATION
        dt_read = elapsedTime() - ts_preread;

        ts_prekernel = elapsedTime(); // TS: begin kernel exec
        // Wait for all threads to start running


        // ############################ BEGIN PROBLEM AREA ##########################
        nthreads_waiting -= nthreads_used;
            // All threads are supposed to wait for this instruction to be executed
            // after finishing their previous iteration.
            // only after this is executed will the threads begin their next iteration
            // Pf) Before this atomic intruction is executed, 
            // nthreads_waiting will represent 
            // [the number of threads awaiting go sign from the main thread]
        while ( !nthreads_running.compare_exchange_strong(nthreads_used, 0,
                    memory_order_seq_cst, memory_order_seq_cst) );
            // Upon main thread's exit of this loop, 
            // all threads have at least started to process their share of payload. 
            // (nthreads_running == nthreads_used => All worker threads are running)
            // while this loop is running, some threads could already be finished 
            // with their iteration and waiting for the beginning of next iteration 
            // after nthreads_waiting++ has completed
            // This is for the main thread (this thread) to wait until
            // all threads begin their iterations.

        while ( nthreads_waiting.load() != nthreads_used );
            // Wait until all threads have finished processing their share
        // ############################ END PROBLEM AREA ############################



        dt_kernel = elapsedTime() - ts_prekernel; // TS: kernel exec finished
        ts_prewrite = elapsedTime();
        //FIXME: SCIF READ OPERATION
        dt_write = elapsedTime() - ts_prewrite; //
        acc_read    += dt_read;
        acc_exec    += dt_kernel;
        acc_write   += dt_write;
        bytes_written += inputbuf_sz;
        bytes_read  += resultbuf_sz;
        iter++;

Both nthreads_waiting and nthreads_running are all declared as atomic_int variables and initialized to zero prior to entering either of the loops.
It runs for several to several hundred loops, which means it partially works,
but it eventually deadlocks out, which means there is some sort of race condition.
(I found out that at some point the two atomics become values between 0 ~ K)
I tried gdb-mic with remote debugging and pdbx enabled, and yet it doesn't detect any data races.
I have modeled the logic in Uppaal, still no deadlocks detected.

Is there something I'm missing?

As always, I welcome all insights.
Thanks for your attention

Jun

jimdempseyatthecove · ‎12-03-2014

Jun,

Why do you have a master/slave organization where the master does no work?
Why do your barriers burn CPU cycles that would otherwise be available for working threads in each core?

You might want to read through this series of articles:

https://software.intel.com/en-us/search/site/language/en?query=chronicles

The following is the barrier routine from my QuickThread (TM) programming toolkit. The code snip provided below, came from a library that did not target MIC (i.e. worked on IA32, Intel64 and AMD64 CPUs). The MIC CPU does not have the PAUSE instruction (_mm_pause() intrinsic). The following code snip from my toolkit is edited and should work on Host and MIC:

// in some configuration header
#if defined(__MIC__)
#define WAIT_A_BIT _mm_delay_32(10)
#else
#define WAIT_A_BIT _mm_pause();
#endif
...
// in your tasking atomic class header

struct qtBarrier
{
 volatile intptr_t c1;
 volatile intptr_t c2;
 qtBarrier() { c1 = 0; c2 = 0; }
 void here(intptr_t iThread, intptr_t nThreads)
 {
  if(iThread)
  {
   // indicate this thread reached barrier
   XCHGADD((intptr_t*)&c1, 1);
   while(c1)
    WAIT_A_BIT;
   // here after master detected all other threads at barrier
   // indicate this thread no longer observing c1
   XCHGADD((intptr_t*)&c2, 1);
   // now wait for team master thread
   while(c2)
    WAIT_A_BIT;
  }
  else
  {
   // (iThread==0)
   // wait for all other threads to reached barrier
   while((c1+1)!=nThreads)
    WAIT_A_BIT;
   // here when all other threads of team at barrier
   // release all other threads
   c1 = 0;
   // wait for all other threads to acknowledge release
   // and subsequently, no longer using qtBarrier object
   while((c2+1)!=nThreads)
    WAIT_A_BIT
   // all threads no longer using qtBarrier object
   c2 = 0;
  }
 }
};

Examples of use have the assumption you've assigned canonical thread numbers from 0:nThreads-1, this can be arbitrary numbers or some meaningful sequence. I prefer to use a sequence that can easily disambiguate cores and hardware threads such that given the assigned iThread:

nCores=nThreads / nThreadsPerCore;
iCore = iThread / nThreadsPerCore;
iHT = iThread % nThreadsPerCore;

Regardless of your assignment an example of a team of all threads:

qtBarrier FooBarrier;
void Foo(iThread, nThreads, args,...)
{
  doWork1(iThread, nThreads, args,...);
  FooBarrier.here(iThread, nThreads);
  doWork2(iThread, nThreads, args,...);
  FooBarrier.here(iThread, nThreads);
  ..
}

Assuming you did the thread numbering to disambiguate the cores and HTs. Assume you want to delegate 3 threads per core for task-1 and the 4th thread for task-2, the two tasks run repeatedly in a loop:

qtBarrier FooBarrier1;
qtBarrier FooBarrier2;
qtBarrier FooBarrierAll;
void Foo(iThread, nThreads, args,...)
{
  assert(nThreadsPerCore == 4);
  int iCore = nThreads / nThreadsPerCore;
  int iHT = nThreads % nThreadsPerCore;
  int inner_iThread;
  int inner_nThreads;
  if(iHT)
  { // team 1
    inner_nThreads = nCores * ( nThreadsPerCore - 1);
    inner_iThread = iCore * 3 + iHT - 1;
  }
  else
  { // team 2
    inner_nThreads = nCores;
    inner_iThread = iCore;
  }

  for(int i=0; i<Count; ++i)
  {
    if(iHT)
   { // team 1
     doWork1(inner_iThread, inner_nThreads, args,...);
     FooBarrier1.here(inner_iThread, inner_nThreads);
   }
   else
   { // team 2
     doWork2(inner_iThread, inner_nThreads, args,...);
     FooBarrier2.here(inner_iThread, inner_nThreads);
   }
  } // for
  FooBarrierAll.here(iThread, nThreads);
}

The above does not synchronize teams, I will leave that as an exercise for you.

Jim Dempsey