Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Stuart_M
Beginner
96 Views

Surprising simd behavior

Experimenting with vectorization I've come across some unexpected behavior. For example, the following demo gets 2x slower when #pragma simd is used.

// Vectorization simd slowdown demo
// #pragma simd makes this 2x slower on AVX2 (Haswell) CPU
// Build with: icl /nologo /Qstd=c++11 /Qcxx-features /Wall /QxHost /DNOMINMAX /DWIN32_LEAN_AND_MEAN -D__builtin_huge_val()=HUGE_VAL -D__builtin_huge_valf()=HUGE_VALF -D__builtin_nan=nan -D__builtin_nanf=nanf -D__builtin_nans=nan -D__builtin_nansf=nanf /DNDEBUG /Qansi-alias /O3 /fp:fast=2 /Qprec-div- /Qip /Qopt-report /Qopt-report-phase:vec simd_slowdown.cc

#include <ctime>
#include <iostream>
using namespace std;

// Length Squared
int
length_squared( int * a, int N )
{
	int length_sq( 0 );
#pragma simd // 2x slower with this!
#pragma vector aligned
	for ( int i = 0; i < N; ++i ) {
		length_sq += a[ i ] * a[ i ];
	}
	return length_sq;
}

int
main()
{
	int const N( 4096 ), R( 32*1024*1024 );
	alignas( 32 ) int a[ N ];
#pragma novector
	for ( int i = 0; i < N; ++i ) {
		a[ i ] = 1;
	}
	int s( 0 );
	double const time1 = (double)clock()/CLOCKS_PER_SEC;
#pragma novector
	for ( int r = 1; r <= R; ++r ) {
		s += length_squared( a, N );
	}
	double const time2 = (double)clock()/CLOCKS_PER_SEC;
	cout << time2 - time1 << " s  " << s << endl;
}

This occurs with Intel C++ 2016 on a Haswell system using an AVX2 build. The vectorization reports are similar both ways. For another twist, if you change the array type to float then #pragma simd makes it run 40% faster. Is this just exposing weaknesses in the vectorization engine or is there a rational explanation for this.

Thanks!

0 Kudos
8 Replies
Feilong_H_Intel
Employee
96 Views

Hi Stuart,

I don't have a HSW windows box in my hands.  So I start with a HSW linux box.

I found that icc 16.0 update 1 automatically vectorize the loop in line 16, even #pragma simd & #pragma vector aligned are commented out.  See the opt report below.  I tried to disable vectorization for that loop, with #pragma novector.  Vectorized version spent 10.72 seconds, while unvectorized version spent 49.06 sec.  Do you know if the simd slowdown issue happens on Windows only?

LOOP BEGIN at s.cc(16,2)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at s.cc(16,2)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at s.cc(16,2)
<Remainder loop for vectorization>
LOOP END

Regarding the second problem, I tried it, and unfortunately failed to reproduced on Linux.  Got same execution time as int-length_sq version.  Here is my code.

        float length_sq( 0 );
#pragma simd // 2x slower with this!
#pragma vector aligned
        for ( int i = 0; i < N; ++i ) {
                length_sq += a[ i ] * a[ i ];
        }
        return (int)length_sq;

I'll try to find a HSW Windows box.

 

TimP
Black Belt
96 Views

When you set #pragma simd you must include relevant reduction (and firstprivate and lastprivate) clauses:

#pragma omp simd reduction(+: length_sq)

I haven't heard of any syntax checking.  You're lucky if poor performance is your only problem with wrong syntax.

I don't find pragma simd or pragma omp simd satisfactory for cases where auto-vectorization occurs without it.  As the option /Qprotect-parens becomes better implemented, there is less reason for setting /fp:source or the like, where you might want the explicit pragma controlled vectorization feature of Intel compilers (which isn't portable).  If you're not into pragma directed vectorization, I'd go so far as to recommend accumulate(), fill(), and inner_product() (or CEAN equivalents, if you don't need portability) over #pragma [omp] simd for these cases where those are available.  I would never recommend the legacy pragma where pragma omp is available.

I've asked (on Fortran forum). but got no answer, whether it is OK to use the current compilers with #pragma omp simd in cases where OpenMP 4.5 suggests using the new ordered pragma, or whether that may be an exception to the dangers of under-specified pragma simd usage.  The legacy Intel pragmas are more dangerous (or surprising, as you put it) than the standard omp ones, with respect to missing clauses.

I've filed some IPS problem reports on cases where the legacy Intel syntax optimizes, but the standard omp doesn't.  When Intel has gone further than anyone else in useful implementation of OpenMP standard, it doesn't make sense to me to leave cases where the non-portable legacy syntax is needed.

Stuart_M
Beginner
96 Views

Feilong, I only have the Windows compilers so that is what I tested on. You are right that it auto-vectorizes without the pragma -- I'm just doing some experiments to better understand how and when I might benefit from the pragmas.

 

Tim, thanks for the info on reduction: I understand that better now. Unfortunately adding the reduction clause didn't change the behavior: I still get 2x slowdown compared to the auto-vectorization I get without the pragma simd. I understand that for such simple loops the auto-vectorization is probably sufficient: I'm just trying a series of loops to see what the effect of the various pragmas is. Seeing it 2x slower is surprising and I'm hoping to understand why.

You seem to have a lot of practical experience with vectorization that I would love to acquire: most of the materials I've found don't go into the realities of when to use approach A vs B. If you have a good reference for some of this hard-won knowledge I'd love to see that.

Thanks!

TimP
Black Belt
96 Views

Did you notice the announcement of my webinar tomorrow (repeated the next day)?

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/604643

Those examples are drawn from those I posted at https://github.com/tprince/lcd showing comparison of optimization for gnu and Intel Fortran/c/c++/cilk.

 

Stuart_M
Beginner
96 Views

Thanks Tim. I will try to join the webinar. Much to learn!

jimdempseyatthecove
Black Belt
96 Views

Tim,

>>#pragma omp simd reduction(+: length_sq)

In the #1 post sample code OpenMP is not used, therefore the compiler option to enable processing of OpenMP directives is likely not specified...

Therefore what is the (standard) policy of "#pragma omp ..." when not enabling processing of OpenMP directives?

IMHO "omp simd" should appear together with "for" such that the loop partitioning is favorable to simd data.

This then means that there should be a non "omp simd" "simd" to apply to loops that are not also partitioned by OpenMP loop directives.

Jim Dempsey

Stuart_M
Beginner
96 Views

I did try

#pragma omp simd reduction(+: length_sq)

using the /Qopenmp switch and it was still almost exactly 2x slower than without any simd pragma. Seems like something is wrong if the simd spec runs the loop 2x slower than the auto-vectorized version.

Stuart

TimP
Black Belt
96 Views

Some cases where omp simd reduction didn't work have been corrected in latest update, and some may be expected in the next.