Software Archive
Read-only legacy content
17061 Discussions

OpenMP program hangs with >256 threads

CFR
New Contributor II
1,061 Views

While trying to characterize some Phi performance over large numbers of OpenMP threads I've noticed strange behavior where programs hang with >256 threads.

I've pared things down to the following example:

#include <omp.h> 
#include <stdio.h>
int main(int argc, char *argv[]) 
{ 
  int threads = 257;
  // omp_set_dynamic(1); 
#pragma omp parallel 
#pragma omp single
  { 
    printf("single %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
  } 
#pragma omp parallel num_threads(threads)
  {
    for (int j=0; j<10; j++) {
      printf("thread %d iteration: %d\n", omp_get_thread_num(), j);
    }
  }
}

If I compile/run this on my Phi host it works fine.  If I compile and run this on the Phi, it prints out the 10 iterations then hangs.  If you set threads <=256 then it works fine on the Phi.  If you omp_set_dynamic(1) and set the threads >256 then it works fine.

The key seems to be having the first default parallel region followed by a second one that uses more than 256 threads.  I haven't found a good description of the nuances of dynamic threads, but I can see how omp_set_dynamic might be required (though it didn't seem obvious to me that having different parallel regions with different numbers of threads was all that "dynamic" ;^)).  I'm definitely not sure why simple hanging for >256 threads is the appropriate behavior.

0 Kudos
6 Replies
jimdempseyatthecove
Honored Contributor III
1,061 Views

Please provide a listing of your OpenMP related environment variables. In particular the number of threads and if nested parallel regions is enabled.

Barring environment variables the number of threads would be the number of hardware threads and nested parallel regions disabled (TimP would be an authority on this). Therefore the default behavior is expected to create a parallel region at line 7 with 240 threads (4x your # cores), one of the threads will grab the single section, which by the way is the only statement in the first parallel region.

The second parallel region will be specifying a number of threads that exceed the "TreadsAvailable" and due to the defaults in effect, this leads to the last decision in OpenMP 4.0 "then the behavior is implementation defined".

I don't think that Intel chose the "implementation defined" action as to crash. I will assume this is a errant behavior.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,061 Views

I've never had satisfactory results for such large numbers of threads, so I don't claim expertise here.  I'll point out that you may need to set stacksize unlimited even with default Omp_stacksize as you won't get a clear symptom like segfault .

0 Kudos
pbkenned1
Employee
1,061 Views

We need information on your environment to assist you further, as Jim requested.  We also need to know the compiler and MPSS version you are using.  I tested with icc-15.0.2 and MPSS-3.4.1.  Just using the environment defaults, I get all 257 threads as requested on the Xeon host, and 256 threads on MIC.  I can't explain the latter, but I don't get a hang on MIC; I presume that's what your most concerned about. 

 

Why are you trying to use so many threads?  There's no conceivable benefit of vastly oversubscribing the machine.

***ON MIC***

[pbkenned]# grep 'thread 255' U540288.cpp-MIC.out
thread 255 iteration: 0
thread 255 iteration: 1
thread 255 iteration: 2
thread 255 iteration: 3
thread 255 iteration: 4
thread 255 iteration: 5
thread 255 iteration: 6
thread 255 iteration: 7
thread 255 iteration: 8
thread 255 iteration: 9

[pbkenned]# wc -l U540288.cpp-MIC.out
2561 U540288.cpp-MIC.out

[pbkenned]# head U540288.cpp-MIC.out
single 0 of 228
 

***ON XEON***

[U540288]$ grep 256 U540288.cpp-Xeon.out
thread 256 iteration: 0
thread 256 iteration: 1
thread 256 iteration: 2
thread 256 iteration: 3
thread 256 iteration: 4
thread 256 iteration: 5
thread 256 iteration: 6
thread 256 iteration: 7
thread 256 iteration: 8
thread 256 iteration: 9


[U540288]$ wc -l U540288.cpp-Xeon.out
2571 U540288.cpp-Xeon.out


[U540288]$ head U540288.cpp-Xeon.out
single 3 of 32
 

Patrick

 

0 Kudos
CFR
New Contributor II
1,061 Views

 

First, apologies.  When I typed in the example code, I missed a line (no wonder Patrick's output didn't make sense to me).  There should be an "omp for" at line 14.  Here's the (hopefully now correct) bad example code:

#include <omp.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
  int threads = 257;
  // omp_set_dynamic(1);
#pragma omp parallel
#pragma omp single
  {
    printf("single %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
  }
#pragma omp parallel num_threads(threads)
  {
#pragma omp for
    for (int j=0; j<10; j++) {
      printf("thread %d iteration: %d\n", omp_get_thread_num(), j);
    }
  }
}

More details....

System environment:

ICC 15.0.1.133 Build 20141023

MPSS 3.4.2

dmesg on the mic says: Linux version 2.6.38.8+mpss3.4.2

Compile :

icc -mmic -std=c99 -o t256-mic t256.c -qopenmp

OpenMP Environment: (basically the default, but here are one's that people asked about or seem relevant)

_OPENMP='201308'

OMP_NUM_THREADS: value not defined

OMP_THREAD_LIMIT='2147483647'

OMP_NESTED='FALSE'

OMP_DYNAMIC='FALSE'

...

My output is (execute native mode on the Phi):

[user@host-mic0 ~] ./t256-mic

single: 0 of 244

thread 1 iteration: 1

thread 2 iteration: 2

thread 0 iteration: 0

thread 8 iteration: 8

thread 7 iteration: 7

thread 6 iteration: 6

thread 3 iteration: 3

thread 4 iteration: 4

thread 5 iteration: 5

thread 9 iteration: 9

<hangs>

In response to postings:

  1. The parallel regions are not nested (as least that's my understanding/intention)
  2. The default stack size (4M?) should be sufficient for 257 threads with this simple code (my card has 16GB)
  3. I'm interested in many threads to see how applications scale.  (I use 1..300 but it's only at 257 or more that things hang)
  4. I acknowledge it is a strange example, but it's just meant to illustrate the "bug".  While it may be "silly" I believe it is logically correct and shouldn't result in the executable hanging.  The problem here originally bit me on a larger application.

Sorry for the confusion.  My environment is kinda crippled and difficult to work with.

0 Kudos
pbkenned1
Employee
1,061 Views

Thanks for the update and the environment details.  Your system is using a recent compiler and MPSS, so no issues there.  Indeed, there is no nested parallelism.  I'll investigate now and follow up.

Patrick

0 Kudos
pbkenned1
Employee
1,061 Views

Yes, this hangs on MIC with >= 257 threads requested, and runs fine at <= 256.  I've reported this to the developers, and I'll pass along updates I receive from them.

Patrick

Internal tracking ID # DPD200366198

0 Kudos
Reply