Performance with thread pooling in OpenMP

Krzysztof_B_Intel · ‎04-11-2016

I have following code:

float arr[1000];
double pre = omp_get_wtime();
for(int j=0; j<1000; ++j)
{
  #pragma omp parallel num_threads(t1)
  {
    #pragma omp for
    for(int i=0; i<1000; ++i) arr = std::pow(i,2);
  }
  #pragma omp parallel num_threads(t2)
  {
    #pragma omp for
    for(int i=0; i<1000; ++i) arr = std::pow(i,2);
  }
}
double post = omp_get_wtime();
double diff = post - pre;

I get strange times for t1 and t2:

for t1=1, t2=36 diff is 0.070
for t1=2, t2=36 diff is 1.307
for t1=8, t2=36 diff is 1.023
for t1=18, t2=36 diff is 0.690
for t1=24, t2=36 diff is 0.427
for t1=36, t2=36 diff is 0.076

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, cores per socket: 18, virtualization: VT-x, sockets: 2, L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 46080K, CentOS 7

Is there any problem with thread pooling between sections (teams) in OpenMP ?
Thanks in advance.

jimdempseyatthecove · ‎04-14-2016

At the start of your program add

int nThreads;
#pragma omp parallel
nThreads = omp_get_num_threads();

The intention is to enter a first parallel region, that is outside of your timed loop, and thus pre-creating the OpenMP thread pool with a full complement of threads. (You will have to expand on this if you use nested parallel regions). The way you structured your program, each increase in thread count caused unnecessary overhead.

Please do this and report back your findings.

Jim Dempsey

Krzysztof_B_Intel · ‎04-15-2016

Unfortunately, it doesn't work. Our team still try to find a solution.
Thank you for your answer.

Krzysztof Binias

jimdempseyatthecove · ‎04-17-2016

With the worst case adding more than one second to the best case leads me to suspect this is a program initialization issue. Such discrepancy can occur if your system is heavily loaded. Can you upload your entire test program that exhibits this problem. One such example is if you specify an (obscenely) large stack size, your program is doing something to "first touch" this stack, and as a consequence each new thread instantiation causes an excessive amount of page faults (to allocate from page file, map to VM, possibly wipe), all in competition with other demands on your storage system. This would occur as a once only symptom. Once OpenMP creates the thread (adds to a given thread pool), the threads remain available for first and subsequent use. Thereafter any new "first touch" of your VM would undergo page fault hoop jump.

Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses?

Jim Dempsey

Krzysztof_B_Intel · ‎04-18-2016

> Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses ?

The same problem. Test source code attached to this post.

Krzysztof Binias