Re: openMP with O2 optimization level

rafael1234 · ‎08-12-2009

I wrote a program with openMP directives and compiled it with the intel compiler several times.
When compiling with O0 optimization level,performance is very good.
But when compiling with O2 optimization level I noticed a significant drop in performance. in fact, the program runs slower with openmp+O2 than without openmp and O2.
Does anyone know why is this happening? what does O2 exactly do that causes the major drop in performance when using openmp?
Thanks!!

Michael_K_Intel2 · ‎08-12-2009

Quoting - rafael1234

I wrote a program with openMP directives and compiled it with the intel compiler several times.
When compiling with O0 optimization level,performance is very good.
But when compiling with O2 optimization level I noticed a significant drop in performance. in fact, the program runs slower with openmp+O2 than without openmp and O2.
Does anyone know why is this happening? what does O2 exactly do that causes the major drop in performance when using openmp?
Thanks!!

Hi!

Can you be more specific on what you mean with "drop in performance"? Is it a hard drop in absolute performance (e.g. -O0 gives 1 minute runtime, -O2 gives 2 minutes)? Or is it a drop in relative performance, e.g. speed-up with different number of threads?

Can you post some timing data for the number of threads used or can you even publish the program so that somebody is able to tell where the problem might be?

Cheers,
-michael

TimP · ‎08-12-2009

Quoting - rafael1234

I wrote a program with openMP directives and compiled it with the intel compiler several times.
When compiling with O0 optimization level,performance is very good.
But when compiling with O2 optimization level I noticed a significant drop in performance. in fact, the program runs slower with openmp+O2 than without openmp and O2.
Does anyone know why is this happening? what does O2 exactly do that causes the major drop in performance when using openmp?
Thanks!!

The one situation I know of this nature occurs on NUMA platforms, where you use an initial first-touch loop to permit the OpenMP run-time to assign shared data in accordance with a fixed static schedule. If the first-touch loop appears to be a do-nothing loop, optimization levels above -O0/0d will do away with it, destroying your effort to obtain locality.
A similar effect has been found in the -parallel option, where the compiler sees no significant work in a first-touch loop and decides not to parallelize it, thus ruining NUMA data locality. I expect to see this remedied on account of SPECfp 2006, but I would not advise you to change from OpenMP, only to try something like the following:
For several years we had been using Makefile with the first-touch loops segregated into their own functions, with those functions specified for -O0/Od.
Noting this may be particularly cumbersome in GUI IDE builds, ifort has provided a source code directive for several years which may be used to set a specific optimization level function by function; it appears this would be useful in Intel C as well. By analogy, for C or C++ it would be
#pragma nooptimize
or
#pragma optimize:0
which would lower the optimization one function at a time.
There may also be a favorable cache locality effect for the first-touch loop, on platforms like Xeon 54xx/74xx, where there are many levels of cache which involve latencies in data transfer if locality isn't maintained.
This first touch localization is most effective when all OpenMP loops use the same number of threads, with (default) static schedule.
Thanks for reminding me of this; I'll put it in my articles about OpenMP.

rafael1234 · ‎08-12-2009

I added another flag to the compiler - generate alternate code paths (-axSSSE3) and the problem went away!
not only do I get better performance now, but I also see an improvement in openMP+O2 code vs. openMP+O0 code.
unfortunately I can't post my code, since my computer isn't connected to the internet...
this is some of my code:
#pragma omp parallel num_threads(2)
{
#pragma omp sections nowait
{
#pragma omp section
sort1(theArr,ARR_SIZE);
#pragma omp section
sort2(theArr,ARR_SIZE);
}
}

sort1 and sort2 are heavy sort functions (with omp directives inside)

the times I measured are as following: (these are average measures, after running the program many times)
O0+omp ~ 200ms
O2+omp ~ 800ms
O0+omp+axSSSE3 ~ 200ms
O2+omp+axSSSE3 ~ 130ms

Om_S_Intel · ‎08-13-2009

The run time of application is small. I will use Intel Vtune and Thread profiler to investigate the issue. It would be nice if you could share the testcase.

rafael1234 · ‎08-13-2009

ok I changed the measuring method, and apparently the readings were wrong. now everything works as expected.
Thanks for all your help!

Michael_K_Intel2 · ‎08-13-2009

Quoting - rafael1234

ok I changed the measuring method, and apparently the readings were wrong. now everything works as expected.
Thanks for all your help!

You're welcome.

After looking at your sort example, I have only a small comment left to make. From the snippet above, I assume that you use nested parallelism in the sort routines. So, your sort routine again contains another parallel region with 2 threads. From an OpenMP standpoint, this is correct, but OpenMP 3.0 now offers OpenMP tasks that way better serve your needs when parallelizing a recursive sort function. Just have a look at it, maybe it gives you some speedup for your app.

Cheers,
-michael

Mike_Rezny · ‎11-01-2009

Quoting - tim18

The one situation I know of this nature occurs on NUMA platforms, where you use an initial first-touch loop to permit the OpenMP run-time to assign shared data in accordance with a fixed static schedule. If the first-touch loop appears to be a do-nothing loop, optimization levels above -O0/0d will do away with it, destroying your effort to obtain locality.
A similar effect has been found in the -parallel option, where the compiler sees no significant work in a first-touch loop and decides not to parallelize it, thus ruining NUMA data locality. I expect to see this remedied on account of SPECfp 2006, but I would not advise you to change from OpenMP, only to try something like the following:
For several years we had been using Makefile with the first-touch loops segregated into their own functions, with those functions specified for -O0/Od.
Noting this may be particularly cumbersome in GUI IDE builds, ifort has provided a source code directive for several years which may be used to set a specific optimization level function by function; it appears this would be useful in Intel C as well. By analogy, for C or C++ it would be
#pragma nooptimize
or
#pragma optimize:0
which would lower the optimization one function at a time.
There may also be a favorable cache locality effect for the first-touch loop, on platforms like Xeon 54xx/74xx, where there are many levels of cache which involve latencies in data transfer if locality isn't maintained.
This first touch localization is most effective when all OpenMP loops use the same number of threads, with (default) static schedule.
Thanks for reminding me of this; I'll put it in my articles about OpenMP.

Hi Tim,
could you please point me at your OpenMP articles. Any pointers relating to memory placement for variableswith IntelC or Fortran compilers on NUMA nodes using Nehalem, i7 would be much appreciated.

regards
Mike

TimP · ‎11-01-2009

Quoting - Mike Rezny

Hi Tim,
could you please point me at your OpenMP articles. Any pointers relating to memory placement for variableswith IntelC or Fortran compilers on NUMA nodes using Nehalem, i7 would be much appreciated.

regards
Mike

Memory placement isn't a critical issue for single socket Core i7. For dual and 4 socket, with 6 and more cores per socket, it's increasingly important. Thanks for reminding me of my promise to expand treatment of OpenMP. Maybe tomorrow. The early version is at http://sites.google.com/site/tprincesite/fortran-programming-hints
The companion examples show a rudimentary case of first touch data placement.

Mike_Rezny · ‎11-01-2009

Quoting - tim18

Memory placement isn't a critical issue for single socket Core i7. For dual and 4 socket, with 6 and more cores per socket, it's increasingly important. Thanks for reminding me of my promise to expand treatment of OpenMP. Maybe tomorrow. The early version is at http://sites.google.com/site/tprincesite/fortran-programming-hints
The companion examples show a rudimentary case of first touch data placement.

Hi Tim,
many thanks. I am heading off to look at what you have written...

My interest is in dual socket nodes, so memory placement is indeed an issue.

I am interested in OpenMPso any information relating to finding out where variables are located in physical memory, tricks and tips on coercing correct placement, and any tools etc such as using vtune to analyse hardware counters that can report on good / bad memory placement would be appreciated.

Also, are you aware of any enhancements etc planned for the Intel compilers to assist in optimally mapping cores to memory for OpenMP? I suspect that this would be a topic of interest regarding ClusterOpenMP as well.

regards
Mike

TimP · ‎11-01-2009

VTune counters for local and remote memory reference were discussed briefly at http://software.intel.com/en-us/forums/showthread.php?t=68432
Performance Tuning Utility has more detailed facility for analyzing memory locality.
The only change I was aware of was to make -parallel work effectively for SPEC fp by applying the same static OpenMP scheduling to first touch loops and data intensive loops, when using a KMP_AFFINITY setting which sets 1 thread per core. Unfortunately, it appears to be necessary to examine the current state of HyperThread enabling and BIOS mapping of logical processors before that KMP_AFFINITY setting can be determined.
Intel MPI I_MPI_PIN_DOMAIN is designed to facilitate memory locality in cluster runs with mixed OpenMP/MPI.