- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Experimenting with vectorization I've come across some unexpected behavior. For example, the following demo gets 2x slower when #pragma simd is used.
// Vectorization simd slowdown demo // #pragma simd makes this 2x slower on AVX2 (Haswell) CPU // Build with: icl /nologo /Qstd=c++11 /Qcxx-features /Wall /QxHost /DNOMINMAX /DWIN32_LEAN_AND_MEAN -D__builtin_huge_val()=HUGE_VAL -D__builtin_huge_valf()=HUGE_VALF -D__builtin_nan=nan -D__builtin_nanf=nanf -D__builtin_nans=nan -D__builtin_nansf=nanf /DNDEBUG /Qansi-alias /O3 /fp:fast=2 /Qprec-div- /Qip /Qopt-report /Qopt-report-phase:vec simd_slowdown.cc #include <ctime> #include <iostream> using namespace std; // Length Squared int length_squared( int * a, int N ) { int length_sq( 0 ); #pragma simd // 2x slower with this! #pragma vector aligned for ( int i = 0; i < N; ++i ) { length_sq += a[ i ] * a[ i ]; } return length_sq; } int main() { int const N( 4096 ), R( 32*1024*1024 ); alignas( 32 ) int a[ N ]; #pragma novector for ( int i = 0; i < N; ++i ) { a[ i ] = 1; } int s( 0 ); double const time1 = (double)clock()/CLOCKS_PER_SEC; #pragma novector for ( int r = 1; r <= R; ++r ) { s += length_squared( a, N ); } double const time2 = (double)clock()/CLOCKS_PER_SEC; cout << time2 - time1 << " s " << s << endl; }
This occurs with Intel C++ 2016 on a Haswell system using an AVX2 build. The vectorization reports are similar both ways. For another twist, if you change the array type to float then #pragma simd makes it run 40% faster. Is this just exposing weaknesses in the vectorization engine or is there a rational explanation for this.
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Stuart,
I don't have a HSW windows box in my hands. So I start with a HSW linux box.
I found that icc 16.0 update 1 automatically vectorize the loop in line 16, even #pragma simd & #pragma vector aligned are commented out. See the opt report below. I tried to disable vectorization for that loop, with #pragma novector. Vectorized version spent 10.72 seconds, while unvectorized version spent 49.06 sec. Do you know if the simd slowdown issue happens on Windows only?
LOOP BEGIN at s.cc(16,2)
remark #15300: LOOP WAS VECTORIZED
LOOP ENDLOOP BEGIN at s.cc(16,2)
<Remainder loop for vectorization>
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP ENDLOOP BEGIN at s.cc(16,2)
<Remainder loop for vectorization>
LOOP END
Regarding the second problem, I tried it, and unfortunately failed to reproduced on Linux. Got same execution time as int-length_sq version. Here is my code.
float length_sq( 0 );
#pragma simd // 2x slower with this!
#pragma vector aligned
for ( int i = 0; i < N; ++i ) {
length_sq += a[ i ] * a[ i ];
}
return (int)length_sq;
I'll try to find a HSW Windows box.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you set #pragma simd you must include relevant reduction (and firstprivate and lastprivate) clauses:
#pragma omp simd reduction(+:
length_sq)
I haven't heard of any syntax checking. You're lucky if poor performance is your only problem with wrong syntax.
I don't find pragma simd or pragma omp simd satisfactory for cases where auto-vectorization occurs without it. As the option /Qprotect-parens becomes better implemented, there is less reason for setting /fp:source or the like, where you might want the explicit pragma controlled vectorization feature of Intel compilers (which isn't portable). If you're not into pragma directed vectorization, I'd go so far as to recommend accumulate(), fill(), and inner_product() (or CEAN equivalents, if you don't need portability) over #pragma [omp] simd for these cases where those are available. I would never recommend the legacy pragma where pragma omp is available.
I've asked (on Fortran forum). but got no answer, whether it is OK to use the current compilers with #pragma omp simd in cases where OpenMP 4.5 suggests using the new ordered pragma, or whether that may be an exception to the dangers of under-specified pragma simd usage. The legacy Intel pragmas are more dangerous (or surprising, as you put it) than the standard omp ones, with respect to missing clauses.
I've filed some IPS problem reports on cases where the legacy Intel syntax optimizes, but the standard omp doesn't. When Intel has gone further than anyone else in useful implementation of OpenMP standard, it doesn't make sense to me to leave cases where the non-portable legacy syntax is needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Feilong, I only have the Windows compilers so that is what I tested on. You are right that it auto-vectorizes without the pragma -- I'm just doing some experiments to better understand how and when I might benefit from the pragmas.
Tim, thanks for the info on reduction: I understand that better now. Unfortunately adding the reduction clause didn't change the behavior: I still get 2x slowdown compared to the auto-vectorization I get without the pragma simd. I understand that for such simple loops the auto-vectorization is probably sufficient: I'm just trying a series of loops to see what the effect of the various pragmas is. Seeing it 2x slower is surprising and I'm hoping to understand why.
You seem to have a lot of practical experience with vectorization that I would love to acquire: most of the materials I've found don't go into the realities of when to use approach A vs B. If you have a good reference for some of this hard-won knowledge I'd love to see that.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you notice the announcement of my webinar tomorrow (repeated the next day)?
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/604643
Those examples are drawn from those I posted at https://github.com/tprince/lcd showing comparison of optimization for gnu and Intel Fortran/c/c++/cilk.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim. I will try to join the webinar. Much to learn!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
>>#pragma omp simd reduction(+: length_sq)
In the #1 post sample code OpenMP is not used, therefore the compiler option to enable processing of OpenMP directives is likely not specified...
Therefore what is the (standard) policy of "#pragma omp ..." when not enabling processing of OpenMP directives?
IMHO "omp simd" should appear together with "for" such that the loop partitioning is favorable to simd data.
This then means that there should be a non "omp simd" "simd" to apply to loops that are not also partitioned by OpenMP loop directives.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did try
#pragma omp simd reduction(+: length_sq)
using the /Qopenmp switch and it was still almost exactly 2x slower than without any simd pragma. Seems like something is wrong if the simd spec runs the loop 2x slower than the auto-vectorized version.
Stuart
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some cases where omp simd reduction didn't work have been corrected in latest update, and some may be expected in the next.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page