we are doing some experiments, comparing the execution time of C code compiled with and without the option -parallel (icc 18.0.0) and we noticed that when we compile code that has no OpenMP pragmas with the flag -O3, if we also use the flag -qopenmp, the execution time is slower that when we do not use it. If we use -O2 instead of -O3, there is no performance difference, in the code we tested.
Could you give some insight on what might be happening? Is -qopenmp preventing optimizations that would otherwise be applied when using -O3?
Please provide some information as to the run times observed and the nature of your parallel region.
Note, the amount of work performed in the loop/parallel region must be sufficient enough to warrant the overhead of entering and exiting a parallel region.
there are no parallel regions, it is serial code without OpenMP pragmas. I am testing with the benchmark 'adi' from PolyBench/C 4.2. The benchmark contains some OpenMP pragmas in the file polybench.c which are used during initialization, and that I have manually removed. When compiling using the following command:
icc -Iutilities -Istencils/adi ./stencils/adi/adi.c ./utilities/polybench.c -DPOLYBENCH_TIME -DLARGE_DATASET -O3
I get an execution time of around 18s. If I compile with the exact same arguments and add -qopenmp, the execution time increases to 22s. If I compile with -O2, the execution time is the same, with and without -qopenmp (22s).
I am using a PC with two Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, with turbo and NUMA disabled.
Can you please clarify which compiler option you actually use on your serial code -parallel or -qopenmp?
And how this your post is related to the other one - https://software.intel.com/en-us/comment/1918452 ? Is it about the same problem?
Does the serial code (as I understand removed #pragma omp parallel from polybench.c, right?) still have OpenMP API calls like omp_get_thread_num()?
Suppose you removed parallel pragmas from polybench.c, added -parallel and generated opt-report. What exactly did you get in the report? Which loops were auto-parallelized?
It would be useful to have your original ("serial") code, opt reports and the code with changes and also compiler options you use in all cases.
Thank you answering.
In this case, we are not using the flag -parallel, just -qopenmp. We are sorry, the question could have been framed more clearly. This question is not related with the other post https://software.intel.com/en-us/comment/1918452, it is a different problem that appeared when testing the same code.
In this question, I am only using serial code, there is no code with OpenMP pragmas, and I tested the flags indicated above, with the difference being adding or removing the flag -qopenmp. When I manually removed de OpenMP pragmas, I have also removed OpenMP API calls such as omp_get_thread_num(). To manually remove the code what I did was to comment all # ifdef _OPENMP / #endif sections on the file polybench.c.
To be more clear, what I am asking in this post is if it is possible to give some insight on why compiling a serial version of the benchmark 'adi' from PolyBench with flags -O3 and -qopenmp results in a slower binary when compared with using the flag -O3 and not using the flag -qopenmp. The performance seems to be the same as -O2, so a possibility is that -qopenmp might implicitly disable some -O3 optimizations.
The other post - https://software.intel.com/en-us/comment/1918452 - is about loops that apparently are being parallelized with OpenMP but that do not appear in the report. I will write a comment with the information you asked.