- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been compiling some benchmark with ICC. I am seeing results where optimized version i.e executable generated with -O3 takes more time than the executable generated with -O0. Although when I generate the vectorization report by using flag -vec-report5 I see that compiler chooses to vectorize because
scalar loop cost : 28
vector loop cost : 7.680
estimated potential speedup: 3.630
But when I run the executables then vectorized version takes more time than the nonvectorized executable, even difference is about 10 secs. I just wanted to know that is it really possible as mentioned in the above case, or am I not able to visualize something.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We would need an actual example of the situation you are questioning. Among the vulnerabilities of icc is not resolving misaligned vector store and reload except when targeting Mic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When the loop trip count is unknown at compile time, the compiler may choose to vectorize code that will eventually use small trip counts.
When you know the representative loop counts, consider using
#pragma loop_count min(yourGuessAtMinimum), max(yourGuessAtMaximum), avg(YourGuessAtAverage)
Otherwise, when you have a mix of small and large loops then use
if(n < YourCutoff) {
#pragma nosimd
for(...) {...}
else
#pragma simd
for(...) {...}
endif
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps it would be a good feature extension to have:
#pragma simd if(n > Cutoff)
The you would not need to double write the same code statements.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
It would be more clear if you can provide your sample code or attach the hostpot aera screenshot and the coresponding assembly aera if you got vtune at hand.
Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:
User forums: http://software.intel.com/en-us/forums/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 2015 compiler improved handling of many short vectorized loops

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page