- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a C++ source file which has the following structure:
[bash]vectorfileNames; //fill in fileNames vector for(size_t i = 0; i < fileNames.size(); i++) { //process fileNames }[/bash]
Each file above can be processed independently.
After reading the following explanation from Intel C++ Compiler User Guide, I thought -parallel option seems to be the first thing that I should try to improve the perfomrance of my application:
[bash]The auto-parallelization feature of the Intel compiler automatically translates serial portions of the input program into equivalent multithreaded code. The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel. [/bash]
The hardware configuration of the system on which I am running my application is:
[bash]Intel CPU Core i7 950 3.06GHz, 8MB cache DDR3-1600MHz RAM[/bash]
However, when I used -parallel option together with -02 (default optimization level), I could not observe any performance gain ,compared to the case where -02 was used by itself, even though my source code seems to have a desirable structure for auto-parallelization.
I also tried -axSSE4.2 option, but, again, it did not provide me any performance gain. (On the contrary, it worsened things)
What else can I do to be able to observe a performance gain in terms of speed ?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By default, you should get some message like:--
remark: LOOP WAS AUTO-PARALLELIZED -- for all successful places.
It will help to use -par-report in compiler option to see the diagnostic messages reported by auto-parallelizer.
Also, you need to ensure that the function process (...) is thread-safe (like eg. not having global variables etc). Also, the structure of function is simple, including no branching or jumps, as stated in the doc.
Could also try whether inlining the function with -ip or -ipo could help. As, in some cases inlining could simplify the auto-parallel. Could help if there are less large functions. -ipo has many other optimizations than inlining.
Or, put explicit OpenMP directives
The reports can tell much:--
/Qopenmp-report{0|1|2} control the OpenMP parallelizer diagnostic level
/Qpar-report{0|1|2|3} control the auto-parallelizer diagnostic level
You may also try O3 and going through
So, in case the compiler assumes dependencies for eg. memory aliasing, but there is none, you can inform compiler about it using ivdep or restrict pragmas, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also added "#pragma omp parallel for" statement before the "for" loop which seems to be a good worksharing candidate.
However, at compile time, I get a large number of the following remarks which apparently complain about C++ STL usage especially including C++ vectors used in many places in my for loop:
[bash]/usr/include/c++/4.4.1/bits/vector.tcc(339): (col. 3) remark: parallel dependence: assumed ANTI dependence between __first line 339 and __cur line 339. /usr/include/c++/4.4.1/bits/stl_tree.h(944): (col. 25) remark: parallel dependence: assumed FLOW dependence between __p line 944 and __x line 944. /usr/include/c++/4.4.1/bits/stl_tree.h(944): (col. 25) remark: parallel dependence: assumed ANTI dependence between __x line 944 and __p line 944.[/bash]
After these remarks, I get the message saying that:
[bash]/usr/include/c++/4.4.1/bits/vector.tcc(345): (col. 3) remark: loop was not parallelized: loop is not a parallelization candidate.[/bash]
Should I avoid using C++ STL to make the for loop suitable for auto-parallelization ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash] vectorAuto-vectorization of reverse_copy() is of minor importance, but should be achieved with SSE4 or LRB.Cr(m); reverse_copy(&c__[1],&c__ +1,Cr.begin()); #pragma omp parallel for if(m > 103) for (i__ = 1; i__ <= m; ++i__) a[i__] += inner_product(Cr.begin(),Cr.end(),&b[i__],0.f);[/bash]
I haven't seen auto-parallel work with STL, but, if you have a good candidate, it might be worth showing. Simply quoting random diagnostic messages doesn't help, unless you can show actual code which you have identified as a hot spot for considering optimization.
transform() together with appropriate restrict qualifiers sometimes facilitates vectorization, but I find it a hindrance to readability; some probably consider that an advantage.
Ideally, min_element() or max_element() auto-vectorization would be desirable, but compilers haven't overcome the obstacles.
STL in general was clearly not designed to facilitate optimization. Even this inner_product required specialized effort on the part of compiler developers. Examples can be found which will break, so cblas_sdot may be preferred for reliability when the vectors are long enough to overcome the library function call overhead.
The proponents of C++0x, Ct, Cilk++, and TBB have agreed to promote those over OpenMP and auto-parallel, so we are likely to be in for a period of increased confusion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The OpenMP support is not going away. Each iteration of the loop need to independent of other to get the code parallelised. The STL is meant for serial code. You certainly need to fix this.
You may be able to use #pragma OMP sections to distribute code to different but may need suitable code changes.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page