Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Performance with thread pooling in OpenMP

Krzysztof_B_Intel
1,963 Views

I have following code:

float arr[1000];
double pre = omp_get_wtime();
for(int j=0; j<1000; ++j)
{
  #pragma omp parallel num_threads(t1)
  {
    #pragma omp for
    for(int i=0; i<1000; ++i) arr = std::pow(i,2);
  }
  #pragma omp parallel num_threads(t2)
  {
    #pragma omp for
    for(int i=0; i<1000; ++i) arr = std::pow(i,2);
  }
}
double post = omp_get_wtime();
double diff = post - pre;

I get strange times for t1 and t2:

  • for t1=1, t2=36 diff is 0.070
  • for t1=2, t2=36 diff is 1.307
  • for t1=8, t2=36 diff is 1.023
  • for t1=18, t2=36 diff is 0.690
  • for t1=24, t2=36 diff is 0.427
  • for t1=36, t2=36 diff is 0.076

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, cores per socket: 18, virtualization: VT-x, sockets: 2, L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 46080K, CentOS 7

Is there any problem with thread pooling between sections (teams) in OpenMP ?
Thanks in advance.

0 Kudos
4 Replies
jimdempseyatthecove
Honored Contributor III
1,963 Views

At the start of your program add

int nThreads;
#pragma omp parallel
nThreads = omp_get_num_threads();

The intention is to enter a first parallel region, that is outside of your timed loop, and thus pre-creating the OpenMP thread pool with a full complement of threads. (You will have to expand on this if you use nested parallel regions). The way you structured your program, each increase in thread count caused unnecessary overhead.

Please do this and report back your findings.

Jim Dempsey

 

0 Kudos
Krzysztof_B_Intel
1,963 Views

Unfortunately, it doesn't work. Our team still try to find a solution.
Thank you for your answer.

Krzysztof Binias

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,963 Views

With the worst case adding more than one second to the best case leads me to suspect this is a program initialization issue. Such discrepancy can occur if your system is heavily loaded. Can you upload your entire test program that exhibits this problem. One such example is if you specify an (obscenely) large stack size, your program is doing something to "first touch" this stack, and as a consequence each new thread instantiation causes an excessive amount of page faults (to allocate from page file, map to VM, possibly wipe), all in competition with other demands on your storage system. This would occur as a once only symptom. Once OpenMP creates the thread (adds to a given thread pool), the threads remain available for first and subsequent use. Thereafter any new "first touch" of your VM would undergo page fault hoop jump.

Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses?

Jim Dempsey

0 Kudos
Krzysztof_B_Intel
1,963 Views

> Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses ?

The same problem. Test source code attached to this post.

Krzysztof Binias

0 Kudos
Reply