- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
we are doing some experiments, comparing the execution time of C code compiled with and without the option -parallel (icc 18.0.0) and we noticed that when we compile code that has no OpenMP pragmas with the flag -O3, if we also use the flag -qopenmp, the execution time is slower that when we do not use it. If we use -O2 instead of -O3, there is no performance difference, in the code we tested.
Could you give some insight on what might be happening? Is -qopenmp preventing optimizations that would otherwise be applied when using -O3?
Thanks,
João Bispo
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please provide some information as to the run times observed and the nature of your parallel region.
Note, the amount of work performed in the loop/parallel region must be sufficient enough to warrant the overhead of entering and exiting a parallel region.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
there are no parallel regions, it is serial code without OpenMP pragmas. I am testing with the benchmark 'adi' from PolyBench/C 4.2. The benchmark contains some OpenMP pragmas in the file polybench.c which are used during initialization, and that I have manually removed. When compiling using the following command:
icc -Iutilities -Istencils/adi ./stencils/adi/adi.c ./utilities/polybench.c -DPOLYBENCH_TIME -DLARGE_DATASET -O3
I get an execution time of around 18s. If I compile with the exact same arguments and add -qopenmp, the execution time increases to 22s. If I compile with -O2, the execution time is the same, with and without -qopenmp (22s).
I am using a PC with two Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, with turbo and NUMA disabled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can run VTune on the two images, Then see where the differences are located.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you please clarify which compiler option you actually use on your serial code -parallel or -qopenmp?
And how this your post is related to the other one - https://software.intel.com/en-us/comment/1918452 ? Is it about the same problem?
Does the serial code (as I understand removed #pragma omp parallel from polybench.c, right?) still have OpenMP API calls like omp_get_thread_num()?
Suppose you removed parallel pragmas from polybench.c, added -parallel and generated opt-report. What exactly did you get in the report? Which loops were auto-parallelized?
It would be useful to have your original ("serial") code, opt reports and the code with changes and also compiler options you use in all cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you answering.
In this case, we are not using the flag -parallel, just -qopenmp. We are sorry, the question could have been framed more clearly. This question is not related with the other post https://software.intel.com/en-us/comment/1918452, it is a different problem that appeared when testing the same code.
In this question, I am only using serial code, there is no code with OpenMP pragmas, and I tested the flags indicated above, with the difference being adding or removing the flag -qopenmp. When I manually removed de OpenMP pragmas, I have also removed OpenMP API calls such as omp_get_thread_num(). To manually remove the code what I did was to comment all # ifdef _OPENMP / #endif sections on the file polybench.c.
To be more clear, what I am asking in this post is if it is possible to give some insight on why compiling a serial version of the benchmark 'adi' from PolyBench with flags -O3 and -qopenmp results in a slower binary when compared with using the flag -O3 and not using the flag -qopenmp. The performance seems to be the same as -O2, so a possibility is that -qopenmp might implicitly disable some -O3 optimizations.
The other post - https://software.intel.com/en-us/comment/1918452 - is about loops that apparently are being parallelized with OpenMP but that do not appear in the report. I will write a comment with the information you asked.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page