problem in fixed-cpu-affinity multi-threaded deployment: "phant

bayolau · ‎03-14-2010

Please let me thank in advance for your time in reading this and any of your ideas.

My code uses openMP to create 4 threads on a Xeon 5550. It has 1 physical socket, 4 physical cores, and 8 logical cores via hyper-threading. The L2 cache is 8MB in size, and each physical core has 256kB cache and a 32kB L1 data cache.

My code is encapsulated as a c++ class, and openmp branching is
done in a virtual member function. I used switch(omp_get_thread_num())
immediately after "#pragma omp parallel" forking to call 4 distinct functions:

thread0: computation in x86 i64 assembly
thread1: computation in x86 i64 assembly
thread2: use read() to fetch data from harddrive
thread3: use read() to fetch data from harddrive

Basically, thread2/3 fetch data for the use of thread0/1. Synchronization is performed after data is fetched. The buffer used for "exchange" is<1MB so everything should be in L2 cache. I have set the thread affinity such that each thread would take both hyper-threaded
cores of a physical core
----------------------------------------------------------
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,4}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {3,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,6}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {1,5}
----------------------------------------------------------

I have tested it for a "small scenario" of (5ms/30sync per fork) and it worked perfect, but when I deploy it for a large scenario of (4000s/~300000sync per fork), performance broke down. So i did the following:

First, I tested the fetching by having thread0/1 call an "empty" function -- a loop with only synchronization and thread2/3 do the fetching as planned, and a snapshot of mpstat gives:
------------------------------------------------------------------------------------------
08:51:53 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
08:51:54 PM all 30.88 0.00 7.75 11.12 0.12 0.25 0.00 49.88 2784.00
08:51:54 PM 0 96.00 0.00 3.00 0.00 0.00 0.00 0.00 1.00 1001.00
08:51:54 PM 1 0.00 0.00 6.00 0.00 0.00 0.00 0.00 94.00 19.00
08:51:54 PM 2 46.00 0.00 25.00 27.00 0.00 1.00 0.00 1.00 761.00
08:51:54 PM 3 21.00 0.00 13.00 63.00 1.00 2.00 0.00 0.00 997.00
08:51:54 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
08:51:54 PM 5 82.18 0.00 14.85 0.00 0.00 0.00 0.00 2.97 0.00
08:51:54 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2.00
08:51:54 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 4.00
-------------------------------------------------------------------------------------------
this seems to be ok since all the iowait is logged by thread2/3. Now,
if I put real computation into thread0/1, a snapshot of mpstat gives:
------------------------------------------------------------------------------------------
09:36:16 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
09:36:17 PM all 34.38 0.00 8.50 7.00 0.12 0.62 0.00 49.38 2458.00
09:36:17 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
09:36:17 PM 1 86.87 0.00 6.06 6.06 0.00 0.00 0.00 1.01 73.00
09:36:17 PM 2 53.54 0.00 27.27 17.17 0.00 2.02 0.00 0.00 665.00
09:36:17 PM 3 50.00 0.00 21.43 25.51 0.00 3.06 0.00 0.00 695.00

09:36:17 PM 4 86.00 0.00 6.00 7.00 0.00 0.00 0.00 1.00 0.00
09:36:17 PM 5 0.00 0.00 1.00 0.00 0.00 0.00 0.00 99.00 0.00
09:36:17 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 11.00
09:36:17 PM 7 0.00 0.00 7.00 0.00 0.00 0.00 0.00 93.00 12.00
---------------------------------------------------------------------------------------------
This is really puzzling since thread0/1 each logged around ~7% "iowait"
without reading anything from disk. The code is memory-bound and there's
no swap activities at all. Where can such "iowait" come from?

or did the linux system run some system process on the cpus? if so is there some way to shield them from cpu0/1?

Dmitry_Vyukov · ‎03-14-2010

> or did the linux system run some system process on the cpus? if so is there some way to shield them from cpu0/1?

A quick though is to bump priority of your computational threads, linux is quite respectful regarding priorities.

bayolau · ‎03-15-2010

thanks. i've tried to use 2 computational thread and 1 io thread then use sudo nice -n -20.. there should be an empty core for everything else... but no luck:(

Grant_H_Intel · ‎03-16-2010

Are you using OpenMP barriers to synchronize the I/O threads with the computation threads within an OpenMP Parallel construct? If so, then the iowait time for computation threads may be because the computation threads are waiting long enough at the barrier they are going to sleep. To prevent that, just set the environment variable KMP_BLOCKTIME=infinite before running your large model. If that doesn't work, then please send the sync code so I can take a look.

Thanks.

bayolau · ‎03-16-2010

Thanks for the advice.

I've shortened the code and produce the following test case with KMP_BLOCKTIME and also attached part of the code. The code is reorganized such that IO is done by thread 0, work is done by thread 1/2. This leaves one physical core empty in case there're other system processes. Here is how I start the run

-------------------------------------------------------------------------
[]$ export KMP_AFFINITY=verbose,scatter
[]$ export KMP_BLOCKTIME=infinite
[]$ ./program
.
.
.
KMP_AFFINITY: Affinity capable, using global cpuid instr KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 1
KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
KMP_AFFINITY: OS proc 5 maps to package 0 core 1 thread 1
KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
KMP_AFFINITY: OS proc 6 maps to package 0 core 2 thread 1
KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,4}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {1,5}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,6}
-----------------------------------------------------------------------

mpstat snapshot
---------------------------------------------------------------------------------------
05:29:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:29:07 PM all 21.80 0.00 11.62 6.12 0.31 1.12 0.00 59.03 4638.00
05:29:07 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1000.50
05:29:07 PM 1 74.13 0.00 9.45 14.93 1.00 0.50 0.00 0.00 186.00
05:29:07 PM 2 0.00 0.00 0.00 0.00 0.50 4.48 0.00 95.02 1702.00
05:29:07 PM 3 0.00 0.00 0.00 0.00 1.00 4.50 0.00 94.50 1747.00
05:29:07 PM 4 17.50 0.00 56.00 26.50 0.00 0.00 0.00 0.00 0.00
05:29:07 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
05:29:07 PM 6 83.00 0.00 10.00 7.00 0.00 0.00 0.00 0.00 1.00
05:29:07 PM 7 0.00 0.00 17.59 0.00 0.00 0.00 0.00 82.41 2.00
------------------------------------------------------------------------------------------

now, if i replace work_serial() with dumm_serial() but keep the fetching going, the mpstat gives
----------------------------------------------------------------------------------------------
05:52:06 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:52:08 PM all 18.24 0.00 15.62 4.93 0.31 1.31 0.00 59.59 6427.14
05:52:08 PM 0 0.00 0.00 0.00 0.50 0.00 0.00 0.00 99.50 1005.53
05:52:08 PM 1 76.00 0.00 23.50 0.00 0.50 0.00 0.00 0.00 65.83
05:52:08 PM 2 70.00 0.00 23.50 0.00 0.50 6.00 0.00 0.00 2864.32
05:52:08 PM 3 0.00 0.00 0.00 0.00 1.01 5.03 0.00 93.97 2488.94
05:52:08 PM 4 0.00 0.00 60.50 39.00 0.00 0.00 0.00 0.50 0.00
05:52:08 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
05:52:08 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.51
05:52:08 PM 7 0.00 0.00 17.00 0.00 0.00 0.00 0.00 83.00 1.51
---------------------------------------------------------------------------------------------

[cpp]//-------------------------------------------------------------------
void myclass::work(dcmplx*const dcpOut,const dcmplx*const dcpIn){

// dcpOut and dcpIn are each ~6GB in size
// write to dcpOut is done by sse streaming store

  omp_set_num_threads(3);
  const double start_time=omp_get_wtime();

// both work_serial and fetch_serial_2() are coded by filling
// appropiate code to the sync structure of dummy_seiral
  #pragma omp parallel default(shared)
  {
    const int id=omp_get_thread_num();
    const int nloc=uN/2;
    switch(id){
      case 1:
        work_serial(dcpOut, dcpIn,dcpIn,
                      csrpOSx, csrpTx);
        break;
      case 2:
        work_serial(&dcpOut[nloc], &dcpIn[nloc], dcpIn,
                      csrpOSy, csrpTy);
        break;
      case 0:
        fetch_serial_2();
        break;
      default:
        dummy_serial();break;
    }
  }
  const double stop_time=omp_get_wtime();
  fprintf(stderr,"done in %en",stop_time-start_time);
}


void myclass::dummy_serial(){
  {
    #pragma omp barrier
    #pragma omp flush
  }
  for(u32 bb=0;bb





//-------------------------------------------------------------------
void myclass::work(dcmplx*const dcpOut,const dcmplx*const dcpIn){

// dcpOut and dcpIn are each ~6GB in size
// write to dcpOut is done by sse streaming store

 omp_set_num_threads(3);
 const double start_time=omp_get_wtime();

// both work_serial and fetch_serial_2() are coded by filling
// appropiate code to the sync structure of dummy_seiral
 #pragma omp parallel default(shared)
 {
 const int id=omp_get_thread_num();
 const int nloc=uN/2;
 switch(id){
 case 1:
 work_serial(dcpOut, dcpIn,dcpIn,
 csrpOSx, csrpTx);
 break;
 case 2:
 work_serial(&dcpOut[nloc], &dcpIn[nloc], dcpIn,
 csrpOSy, csrpTy);
 break;
 case 0:
 fetch_serial_2();
 break;
 default:
 dummy_serial();break;
 }
 }
 const double stop_time=omp_get_wtime();
 fprintf(stderr,"done in %e\n",stop_time-start_time);
}


void myclass::dummy_serial(){
 {
 #pragma omp barrier
 #pragma omp flush
 }
 for(u32 bb=0;bb {
 #pragma omp barrier
 }
 /*
 io threads would update buffer pointer and volatile index here
 */
 {
 #pragma omp barrier
 #pragma omp flush
 }
 /*
 here, other threads work according to x86 assembly (if work_serial() is called)
 or read from disk (if fetch_serial_*() is called)
 */
 }
}
//-------------------------------------------------------------------------


Here is how I start the run

-------------------------------------------------------------------------
[]$ export KMP_AFFINITY=verbose,scatter
[]$ export KMP_BLOCKTIME=infinite
[]$ ./program
.
.
.
KMP_AFFINITY: Affinity capable, using global cpuid instr KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 1
KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
KMP_AFFINITY: OS proc 5 maps to package 0 core 1 thread 1
KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
KMP_AFFINITY: OS proc 6 maps to package 0 core 2 thread 1
KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,4}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {1,5}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,6}
-----------------------------------------------------------------------


mpstat snapshot
---------------------------------------------------------------------------------------
05:29:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:29:07 PM all 21.80 0.00 11.62 6.12 0.31 1.12 0.00 59.03 4638.00
05:29:07 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1000.50
05:29:07 PM 1 74.13 0.00 9.45 14.93 1.00 0.50 0.00 0.00 186.00
05:29:07 PM 2 0.00 0.00 0.00 0.00 0.50 4.48 0.00 95.02 1702.00
05:29:07 PM 3 0.00 0.00 0.00 0.00 1.00 4.50 0.00 94.50 1747.00
05:29:07 PM 4 17.50 0.00 56.00 26.50 0.00 0.00 0.00 0.00 0.00
05:29:07 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
05:29:07 PM 6 83.00 0.00 10.00 7.00 0.00 0.00 0.00 0.00 1.00
05:29:07 PM 7 0.00 0.00 17.59 0.00 0.00 0.00 0.00 82.41 2.00 
------------------------------------------------------------------------------------------

now, if i replace work_serial() with dumm_serial() but keep the fetching going, the mpstat gives
----------------------------------------------------------------------------------------------
05:52:06 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:52:08 PM all 18.24 0.00 15.62 4.93 0.31 1.31 0.00 59.59 6427.14
05:52:08 PM 0 0.00 0.00 0.00 0.50 0.00 0.00 0.00 99.50 1005.53
05:52:08 PM 1 76.00 0.00 23.50 0.00 0.50 0.00 0.00 0.00 65.83
05:52:08 PM 2 70.00 0.00 23.50 0.00 0.50 6.00 0.00 0.00 2864.32
05:52:08 PM 3 0.00 0.00 0.00 0.00 1.01 5.03 0.00 93.97 2488.94
05:52:08 PM 4 0.00 0.00 60.50 39.00 0.00 0.00 0.00 0.50 0.00
05:52:08 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
05:52:08 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.51
05:52:08 PM 7 0.00 0.00 17.00 0.00 0.00 0.00 0.00 83.00 1.51 
---------------------------------------------------------------------------------------------

Grant_H_Intel · ‎03-17-2010

This is pretty puzzling. I did check the mpstat man page at http://man-wiki.net/index.php/1:mpstat where it says:

%iowait: Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

So presumably, it doesn't really measure I/O for each CPU, but instead assumes that when threads are idle and the system is doing I/O, that these idle threads are responsible for the I/O. This is probably a deficiency in the operation of mpstat itself in that it cannot successfully attribute the I/O to the proper CPU all the time, but just during certain times when there is only one idle thread during an I/O wait operation.

The description of the other columns for mpstat don't use such language suggesting heuristic attribution of time to CPUs; they seem to be more precise.

I'm not sure why you get that particular consistent difference when using work_serial vs. dummy_serial with respect to iowait CPU time attribution, but given the imprecision described above, it may be quite difficult to figure out what the numbers for iowait really mean.

If you want to check whether the KMP_AFFINITY settings are really pinning the threads during the execution of your code, just call kmp_get_affinity() (see omp.h for full prototypes) atmany places inyour code and compare to the expected affinitymask.This function will return the current thread affinity maskby querying the system each time it is called, but only on Linux.

I hope this helps.

Grant_H_Intel · ‎03-17-2010

Here is another possibility I just thought of. You state in the code above that

// dcpOut and dcpIn are each ~6GB in size

That means these buffers are already taking ~12 GB of memory. Unless you have plenty more physical memory than that in the system, perhaps all the threads that touch these buffers are paging virtual memory between the physical memory and the hard drive(s). That could also result in I/O wait time without any explicit I/O calls in your computation threads.

So far, I think this explanation best fits the facts, assuming mpstat is actually returning the correct attribution of I/O wait to each CPU (thread).

bayolau · ‎03-17-2010

Thanks, I'll look further into mpstat's operations.

Sorry for the confusion - dcpOut and dcpIn are not the i/o buffer - those are data being processed according to the content I fetched from the harddrive.

Even in the large case, the "fetching" is still done in sections of 1MB. and thus far, there's still 1GB left in RAM (it's a 24GB system) and vmstat hasn't showned any (swap) IO activities.

bayolau · ‎03-18-2010

Acutally you're correct! but it wasn't my program. While my code takes 17GB residence, less than the 24GB RAM, there were other idle "things" that took memory.

Memory deemed "idle" at Linux's discretion got juiced out to swap. My worker cores were indeed doing small traces of io to my swap drive at random intervals. There were the iowait, but it was not noticeable because I set my iostat at the MB/s scale to monitor data harddrives.

Once I've put a mlock() after all of my malloc()'s, swaping got done beforehand, and the phantom iowait disappeared!

thank you very much for the help.

12:18:07 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:18:08 PM all 29.68 0.00 7.23 1.87 0.12 0.87 0.00 60.22 3477.00
12:18:08 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
12:18:08 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:18:08 PM 2 0.00 0.00 0.00 0.00 0.00 2.00 0.00 98.00 1226.00
12:18:08 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 22.00
12:18:08 PM 4 39.00 0.00 46.00 15.00 0.00 0.00 0.00 0.00 0.00
12:18:08 PM 5 0.00 0.00 0.00 0.00 0.00 4.04 0.00 95.96 1220.00
12:18:08 PM 6 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00
12:18:08 PM 7 0.00 0.00 10.10 0.00 0.00 0.00 0.00 89.90 5.00

Grant_H_Intel · ‎03-18-2010

I'm glad that you found the problem! And I'm encouraged that this explanation fits the data you've seen very well. I've spent more time diagnosing performance problems that were caused by virtual memory paging than I care to admit. So you are not alone... ;-}

Best regards,

problem in fixed-cpu-affinity multi-threaded deployment: "phantom" iowait in non-io cpu cores