When the loop trip count is

amit_b_ · ‎11-16-2014

I have been compiling some benchmark with ICC. I am seeing results where optimized version i.e executable generated with -O3 takes more time than the executable generated with -O0. Although when I generate the vectorization report by using flag -vec-report5 I see that compiler chooses to vectorize because

scalar loop cost : 28

vector loop cost : 7.680

estimated potential speedup: 3.630

But when I run the executables then vectorized version takes more time than the nonvectorized executable, even difference is about 10 secs. I just wanted to know that is it really possible as mentioned in the above case, or am I not able to visualize something.

TimP · ‎11-17-2014

We would need an actual example of the situation you are questioning. Among the vulnerabilities of icc is not resolving misaligned vector store and reload except when targeting Mic.

jimdempseyatthecove · ‎11-17-2014

When the loop trip count is unknown at compile time, the compiler may choose to vectorize code that will eventually use small trip counts.

When you know the representative loop counts, consider using

#pragma loop_count min(yourGuessAtMinimum), max(yourGuessAtMaximum), avg(YourGuessAtAverage)

Otherwise, when you have a mix of small and large loops then use

if(n < YourCutoff) {
#pragma nosimd
for(...) {...}
else
#pragma simd
for(...) {...}
endif

Jim Dempsey

jimdempseyatthecove · ‎11-17-2014

Perhaps it would be a good feature extension to have:

#pragma simd if(n > Cutoff)

The you would not need to double write the same code statements.

Jim Dempsey

QIAOMIN_Q_ · ‎11-18-2014

Hello,

It would be more clear if you can provide your sample code or attach the hostpot aera screenshot and the coresponding assembly aera if you got vtune at hand.

Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:

User forums: http://software.intel.com/en-us/forums/

TimP · ‎11-19-2014

The 2015 compiler improved handling of many short vectorized loops

ICC -O3 generates code which takes more time than -O0