icc doesn't like nested openmp ??

will2km · ‎11-06-2009

Hi everyone,
I am new to this forum, and please forgive me if I put this post in the wrong place.

I am sorry I couldn't put my entire code here, but to give you the following pseudo code for illustration. At the mean time, I'll try to explain the issue as much as I can.

10: omp_set_nested(1);
11: omp_set_num_threads(omp_get_num_procs());
12: #pragma omp parallel
13: {
14: for( int i . )
15: {
16: #pragma omp sections
17: {
18: #pragma omp section
19: { fun1(); // no parallelization in function body }
20: #pragma omp section
21: { fun2(); // has parallelization in function body }
22: }
23: }

The above code is meant to do some image processing, where fun1() reads data and fun2() does the processing. But I found out that it runs faster without openmp if the program is built by icc, but slower is built by gcc (which I believe it should be).

Then I had the following tests:
1. commented out line 10, which I believe the parallelization in fun2() will be run in serial, the program is actually running much faster
2. disabled the parallelization in fun2(), and compared the results from keeping line 10 and commenting out line 10, I still found that having nested omp is slower (nearly 2 times slower)
3. changed line 18 and line 20 from section to single and commented out line 16, then compared the results from keeping line 10 and commenting line 10, I found that if I have nested, the program gave me very poor result, namely accuracy issues
4. repeated the above tests on the same machine, but using gcc to compile and build, none of the above problem showed up

BTW, I am using openSUSE 11.0 and the icc version is 11.0 20090318.

Please let me know should you need any more info, and any reply will be highly appreciated.

Have a nice day!

lkeene · ‎11-06-2009

Quoting - will2km

Hi everyone,
I am new to this forum, and please forgive me if I put this post in the wrong place.

I am sorry I couldn't put my entire code here, but to give you the following pseudo code for illustration. At the mean time, I'll try to explain the issue as much as I can.

10: omp_set_nested(1);
11: omp_set_num_threads(omp_get_num_procs());
12: #pragma omp parallel
13: {
14: for( int i . )
15: {
16: #pragma omp sections
17: {
18: #pragma omp section
19: { fun1(); // no parallelization in function body }
20: #pragma omp section
21: { fun2(); // has parallelization in function body }
22: }
23: }

The above code is meant to do some image processing, where fun1() reads data and fun2() does the processing. But I found out that it runs faster without openmp if the program is built by icc, but slower is built by gcc (which I believe it should be).

Then I had the following tests:
1. commented out line 10, which I believe the parallelization in fun2() will be run in serial, the program is actually running much faster
2. disabled the parallelization in fun2(), and compared the results from keeping line 10 and commenting out line 10, I still found that having nested omp is slower (nearly 2 times slower)
3. changed line 18 and line 20 from section to single and commented out line 16, then compared the results from keeping line 10 and commenting line 10, I found that if I have nested, the program gave me very poor result, namely accuracy issues
4. repeated the above tests on the same machine, but using gcc to compile and build, none of the above problem showed up

BTW, I am using openSUSE 11.0 and the icc version is 11.0 20090318.

Please let me know should you need any more info, and any reply will be highly appreciated.

Have a nice day!

How are yousynchronizingthe threads within fun1() and fun2()? Maybe the threads are duplicating work.

-L

jimdempseyatthecove · ‎11-06-2009

As sketched above
line 10 enables nesting
line 11 sets the number of threads for the next parallel region (line 12) to number of procs
note, this is the outer nesting level team
line 14 is NOT a parallel for (is this what you intended)
line 16 begins a section with 2 sections (2 threads of outer level team consumed)
note, the other threads of the outer levelnumber of procs team
are firstspinning their wheels in the line 14 to line 22 for loop
and then secondly firstspinning their wheels at the implicit barrier at line 23
The spin time at the line 23 implicit barrier is for KMP_BLOCKTIME
line 21, calling fun2 calls fun2 with all threads busy (threads after outer nested level
team members 0 and 1 are wasting time.

Inside fun2() when it begins the next nest level, will form a new team with default number of threads. You are now oversubscribed. These extra threads will compete (time sliced by O/S) for CPU resources with the tread running fun1() .AND. with the threads of the outer level performing non-productive work either in the line 14 for loop (sans sections) or in the implicit KMP_BLOCKTIME of those unproductive threads reaching the end of parallel region at line 23

Suggestions,

At line 11 set the number of threads to the number of sections (2)
Inside fun2() set the number of threads to omp_get_num_procs()-1 (IOW leave room for thread running fun1())
If fun1() run time is relatively short as compared to thread slice of fun2 then remove the omp sections and add a omp single arround the call to fun1() and let all threads fall into the call to fun2() and set the number of threads for fun2 to number of procs. You may need to experiment with schedule(dynamic, yourChunk) to get optimal performance.

Jim Dempsey