Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7952 Discussions

regarding auto-parallelization feature

Hi All,

I have a C++ source file which has the following structure:

[bash]vector fileNames;
//fill in fileNames vector
for(size_t i = 0; i < fileNames.size(); i++)
//process fileNames

Each file above can be processed independently.

After reading the following explanation from Intel C++ Compiler User Guide, I thought -parallel option seems to be the first thing that I should try to improve the perfomrance of my application:

[bash]The auto-parallelization feature of the Intel compiler automatically translates serial portions of the
input program into equivalent multithreaded code. The auto-parallelizer analyzes the dataflow of
the loops in the application source code and generates multithreaded code for those loops which
can safely and efficiently be executed in parallel.

The hardware configuration of the system on which I am running my application is:

[bash]Intel CPU Core i7 950 3.06GHz, 8MB cache
DDR3-1600MHz RAM[/bash]

However, when I used -parallel option together with -02 (default optimization level), I could not observe any performance gain ,compared to the case where -02 was used by itself, even though my source code seems to have a desirable structure for auto-parallelization.

I also tried -axSSE4.2 option, but, again, it did not provide me any performance gain. (On the contrary, it worsened things)

What else can I do to be able to observe a performance gain in terms of speed ?

0 Kudos
4 Replies
New Contributor II

By default, you should get some message like:--

remark: LOOP WAS AUTO-PARALLELIZED -- for all successful places.

It will help to use -par-report in compiler option to see the diagnostic messages reported by auto-parallelizer.

Also, you need to ensure that the function process (...) is thread-safe (like eg. not having global variables etc). Also, the structure of function is simple, including no branching or jumps, as stated in the doc.

Could also try whether inlining the function with -ip or -ipo could help. As, in some cases inlining could simplify the auto-parallel. Could help if there are less large functions. -ipo has many other optimizations than inlining.

Or, put explicit OpenMP directives around it after ensuring that the called subroutine is threadsafe.

The reports can tell much:--

/Qopenmp-report{0|1|2} control the OpenMP parallelizer diagnostic level

/Qpar-report{0|1|2|3} control the auto-parallelizer diagnostic level

You may also try O3 and going through in the compiler guide to check for data-dependencies, and to harness the benefits of parallelized + vectorized code at runtime. -xSSE4.2 will enable it, also try using -vec-report to know dependencies and data analysis report .
So, in case the compiler assumes dependencies for eg. memory aliasing, but there is none, you can inform compiler about it using ivdep or restrict pragmas, etc.

0 Kudos
I am currently using -parallel -openmp -par-report3 compiler options to compile my source code.

I also added "#pragma omp parallel for" statement before the "for" loop which seems to be a good worksharing candidate.

However, at compile time, I get a large number of the following remarks which apparently complain about C++ STL usage especially including C++ vectors used in many places in my for loop:

[bash]/usr/include/c++/4.4.1/bits/vector.tcc(339): (col. 3) remark: parallel dependence: assumed ANTI dependence between __first line 339 and __cur line 339.

/usr/include/c++/4.4.1/bits/stl_tree.h(944): (col. 25) remark: parallel dependence: assumed FLOW dependence between __p line 944 and __x line 944.

/usr/include/c++/4.4.1/bits/stl_tree.h(944): (col. 25) remark: parallel dependence: assumed ANTI dependence between __x line 944 and __p line 944.[/bash]

After these remarks, I get the message saying that:

[bash]/usr/include/c++/4.4.1/bits/vector.tcc(345): (col. 3) remark: loop was not parallelized: loop is not a parallelization candidate.[/bash]

Should I avoid using C++ STL to make the for loop suitable for auto-parallelization ?

0 Kudos
Honored Contributor III
Do you have any particular STL in mind for optimization? inner_product() is one of my favorites for auto-vectorization, often in parallel regions:
[bash]      vector Cr(m);
#pragma omp parallel for if(m > 103)
    for (i__ = 1; i__ <= m; ++i__)
        a[i__] += inner_product(Cr.begin(),Cr.end(),&b[i__],0.f);[/bash]
Auto-vectorization of reverse_copy() is of minor importance, but should be achieved with SSE4 or LRB.
I haven't seen auto-parallel work with STL, but, if you have a good candidate, it might be worth showing. Simply quoting random diagnostic messages doesn't help, unless you can show actual code which you have identified as a hot spot for considering optimization.
transform() together with appropriate restrict qualifiers sometimes facilitates vectorization, but I find it a hindrance to readability; some probably consider that an advantage.
Ideally, min_element() or max_element() auto-vectorization would be desirable, but compilers haven't overcome the obstacles.
STL in general was clearly not designed to facilitate optimization. Even this inner_product required specialized effort on the part of compiler developers. Examples can be found which will break, so cblas_sdot may be preferred for reliability when the vectors are long enough to overcome the library function call overhead.

The proponents of C++0x, Ct, Cilk++, and TBB have agreed to promote those over OpenMP and auto-parallel, so we are likely to be in for a period of increased confusion.
0 Kudos

The OpenMP support is not going away. Each iteration of the loop need to independent of other to get the code parallelised. The STL is meant for serial code. You certainly need to fix this.

You may be able to use #pragma OMP sections to distribute code to different but may need suitable code changes.

0 Kudos