Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Intel C++ compiler 2017 does not appear to respect kmp_set_blocktime(0)

nicpac22
Beginner
820 Views

Hi,

I recently noticed after upgrading from icpc 2016.0.4 to icpc 2017.0.4 that all my OMP threaded code appeared to use significantly more CPU (ex. increase from 250% to 460% CPU when allocated 5 threads on an Intel XEON E5-2690).  Using perf top on one of the running processes revealed it was spending a majority of its time on:

libiomp5.so _INTERNAL_25_______src_kmp_barrier_cpp_5678b641::__kmp_hyper_barrier_release

which was not occurring with icpc 2016.  As far as I can tell, kmp_hyper_barrier_release should only tie up CPU when a thread is blocking instead of spinning down, typically due to a KMP BLOCKTIME value > 0, however I was using kmp_set_blocktime(0) prior to each OMP section, which should allow all threads to relinquish their CPU's immediately after execution of the parallel region.  In both icpc 2016 and 2017, calling kmp_get_blocktime() after a kmp_set_blocktime(0) was returning the expected value of 0, however it appears in 2017 the value was not actually being respected by the compiler.

I noticed another user had posted about the same issue here but had not received any feedback.  I verified their workaround of using kmp_set_defaults("BLOCKTIME=0") restored the performance of icpc v2017.0.4 to the expected levels of icpc v2016.0.4.  I was wondering if this is a bug with the 2017 compiler suite or if kmp_set_blocktime is deprecated and there is some other way to set the thread block time?

While the posted workaround helps, in general I'd prefer not to set a session default for the block time on a specific program.  Any help or insight would be appreciated.

Thanks,

Nick

0 Kudos
6 Replies
nicpac22
Beginner
820 Views

In case it matters, the reason the full 5 threads are not being used in the icpc v2016 case is that my software is performing a real-time processing task and the rate of data ingress is not sufficient to keep all the CPU's busy all the time.  The extra threads are allocated to deal with periods where there is a surge of excess data or where other processes on the machine temporarily disrupt the scheduling.  Sharing machine resources with other processes is also the reason I desire a solution where kmp blocktime is set to 0.

0 Kudos
jimdempseyatthecove
Honored Contributor III
820 Views

Are you using OpenMP 4.0 Tasks?

If so, then is the call to kmp_set_blocktime(0) issued in the context of the spawned task .OR. in the context of the code issuing the task?

Jim Dempsey

0 Kudos
nicpac22
Beginner
820 Views

Hi Jim,

I'm not explicitly using tasks, the kmp_set_blocktime call is in my main program outside my parallel region and gets called just prior to a "for" loop which is parallelized with #pragma omp parallel for.  The basic program flow is:

while(1)
{
  // acquire some data in a serial fashion into a buffer
  // do some pre-work on the data
  kmp_set_blocktime(0);
  int i;
  #pragma omp parallel for private(i) schedule(dynamic)
  for(i=0; i<numIterations; ++i)
  {
    // do some work in parallel
  }
}

The reason I call kmp_set_blocktime(0) at the top of each loop is because the "do some pre-work" section includes some code I don't control and can *potentially* change the kmp blocktime via an object constructor.  Prior to running my "for" loop I want to ensure the blocktime is set to 0 despite what might be needed in the other section of code.  I've been assuming in general if the block time is already set to 0, there's no real penalty to calling kmp_set_blocktime(0) over and over again, but please let me know if this is not the case.

Thanks for your help,

Nick

0 Kudos
jimdempseyatthecove
Honored Contributor III
820 Views

>>section includes some code I don't control

What I was trying to get at was, if a change was made from V16 to V17 where the block time is thread local as opposed to global. Note, the environment variable as well as kmp_set_defaults("BLOCKTIME=0") appear to have global effect. I can see some benefit to having differing block times per thread.

IIF there is thread local block times, then I would suspect that prior to first parallel region kmp_set_blocktime(0) would be global.

This is just an assumption on my part.

Jim Dempsey

0 Kudos
nicpac22
Beginner
820 Views

I can try calling kmp_set_blocktime(0) inside the "for" loop to test, but I'm away from my server now so won't be able to test until tomorrow.  I agree there could be benefit to having different block times per thread.  I was just curious why the behavior would change between v16 and v17 without any other changes to the code.  If memory serves, I think v15 behaved the same as v16 so it looks like the change didn't occur until v17.  Is this expected or could it be related to a base change in OMP?

Nick

0 Kudos
nicpac22
Beginner
820 Views

I performed some additional testing today and verified that no matter where the call to kmp_set_blocktime(0) was placed in the code, it did not seem to have an effect.  Here's a simple program to demonstrate the issue:

kmp_block.cc:

#include <unistd.h>
#include <vector>
#include <math.h>
#include <omp.h>

using namespace std;

int main()
{
  int N = 1000;
  vector<float> a(N);
  omp_set_num_threads(4);
  kmp_set_blocktime(0);
  //kmp_set_defaults("KMP_BLOCKTIME=0");
  while(1)
  {
    int i;
    //kmp_set_blocktime(0);
    #pragma omp prallel for private(i) schedule(dynamic)
    for(i=0; i<N; ++i)
    {
      //kmp_set_blocktime(0);
      a = sqrtf(float(rand()));
    }
    // sleep for 10ms, > 0 but < default thread block time
    usleep(10000);
  }
  return 0;
}

No matter which call or combination of calls to kmp_set_blocktime(0) is/are uncommented, the v17 compiler still uses all 4 allocated CPUs with a majority of time spent on kmp_hyper_barrier_release whereas the v16 compiler uses almost no CPU when either of the first two kmp_set_blocktime calls are run.  Curiously, if only the kmp_set_blocktime call inside the inner "for" loop is used, the intel v16 compiler uses significantly more CPU than normal (and kmp_hyper_barrier_release starts to show up in perf top), however it is still less than with v17.  For both compilers, if the kmp_set_defaults line is uncommented, they use very little CPU, indicating the setting is being heeded, and performance is equivalent to the v16 kmp_set_blocktime(0) use case.

For reference, I compiled the code with: icpc -O1 -qopenmp -liomp5 kmp_block.cc -o kmp_block

I also noticed that, as expected, when all calls to kmp_set_blocktime and kmp_set_defaults were commented out, both compiled versions used close to 4 full CPUs and perf top showed the kmp_hyper_barrier_release as the primary user of cycles.  I did notice in v16 the method that showed up in perf top was __kmp_hyper_barrier_release and in v17 it was _INTERNAL_25_______src_kmp_barrier_cpp_5678b641::__kmp_hyper_barrier_release.  I'm not sure if this matters but it does indicate something was changed in the bahvior of how thread locking/blocking is handled between the versions.  I'm not sure if this change is due to a change in the OMP libraries themselves or in the intel compiler.  Could this be a bug in the compiler or is there some other way to get v17 to heed the kmp_set_blocktime call?

Any help is much appreciated.

Thanks,

Nick

0 Kudos
Reply