Thank you answering.

Joao_B_ · ‎02-08-2018

Hello,

we are doing some experiments, comparing the execution time of C code compiled with and without the option -parallel (icc 18.0.0) and we noticed that when we compile code that has no OpenMP pragmas with the flag -O3, if we also use the flag -qopenmp, the execution time is slower that when we do not use it. If we use -O2 instead of -O3, there is no performance difference, in the code we tested.

Could you give some insight on what might be happening? Is -qopenmp preventing optimizations that would otherwise be applied when using -O3?

Thanks,

João Bispo

jimdempseyatthecove · ‎02-11-2018

Please provide some information as to the run times observed and the nature of your parallel region.

Note, the amount of work performed in the loop/parallel region must be sufficient enough to warrant the overhead of entering and exiting a parallel region.

Jim Dempsey

Joao_B_ · ‎02-12-2018

Hello,

there are no parallel regions, it is serial code without OpenMP pragmas. I am testing with the benchmark 'adi' from PolyBench/C 4.2. The benchmark contains some OpenMP pragmas in the file polybench.c which are used during initialization, and that I have manually removed. When compiling using the following command:

icc -Iutilities -Istencils/adi ./stencils/adi/adi.c ./utilities/polybench.c -DPOLYBENCH_TIME -DLARGE_DATASET -O3

I get an execution time of around 18s. If I compile with the exact same arguments and add -qopenmp, the execution time increases to 22s. If I compile with -O2, the execution time is the same, with and without -qopenmp (22s).

I am using a PC with two Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, with turbo and NUMA disabled.

jimdempseyatthecove · ‎02-12-2018

You can run VTune on the two images, Then see where the differences are located.

Jim Dempsey

Olga_M_Intel · ‎02-13-2018

Can you please clarify which compiler option you actually use on your serial code -parallel or -qopenmp?

And how this your post is related to the other one - https://software.intel.com/en-us/comment/1918452 ? Is it about the same problem?

Does the serial code (as I understand removed #pragma omp parallel from polybench.c, right?) still have OpenMP API calls like omp_get_thread_num()?

Suppose you removed parallel pragmas from polybench.c, added -parallel and generated opt-report. What exactly did you get in the report? Which loops were auto-parallelized?

It would be useful to have your original ("serial") code, opt reports and the code with changes and also compiler options you use in all cases.

Joao_B_ · ‎02-19-2018

Thank you answering.

In this case, we are not using the flag -parallel, just -qopenmp. We are sorry, the question could have been framed more clearly. This question is not related with the other post https://software.intel.com/en-us/comment/1918452, it is a different problem that appeared when testing the same code.

In this question, I am only using serial code, there is no code with OpenMP pragmas, and I tested the flags indicated above, with the difference being adding or removing the flag -qopenmp. When I manually removed de OpenMP pragmas, I have also removed OpenMP API calls such as omp_get_thread_num(). To manually remove the code what I did was to comment all # ifdef _OPENMP / #endif sections on the file polybench.c.

To be more clear, what I am asking in this post is if it is possible to give some insight on why compiling a serial version of the benchmark 'adi' from PolyBench with flags -O3 and -qopenmp results in a slower binary when compared with using the flag -O3 and not using the flag -qopenmp. The performance seems to be the same as -O2, so a possibility is that -qopenmp might implicitly disable some -O3 optimizations.

The other post - https://software.intel.com/en-us/comment/1918452 - is about loops that apparently are being parallelized with OpenMP but that do not appear in the report. I will write a comment with the information you asked.

Flag -qopenmp makes serial code slower when using -O3