Solved: Question about Performance of Intrinsic Function

paul_shu · ‎05-15-2012

I implemented a dot product of real vector and complex vector using C++, SSE2, F32vec4 and IPP seperately. When I compared their performance, I find a weird problem.

When the vector length is 32, the output is:

C++ computation: 426.763611 ms
SSE computation: 87.550690 ms
Vec computation: 8.865436 ms
IPP computation: 223.702408 ms

When the vector length is 34, the output is
C++ computation: 402.208557 ms
SSE computation: 217.374542 ms
Vec computation: 220.603912 ms
IPP computation: 246.478622 ms

Build command is
icl DotProd.cpp /MD /O2 /D NDEBUG /TP /EHsc /W3 /nologo /D _CONSOLE /D _MBCS /D WIN32 /D _CRT_SECURE_NO_DEPRECATE /link ipps.lib /INCREMENTAL:NO /NOLOGO /DEBUG /FIXED:NO /OUT:DotProd.exe

The version of Inter C++ compiler is 12.1

My question is: Why the performance of SSE and Vec degrade so dramatically, while IPP library seems like much more stable? Did I miss some optimization options of the compiler?

Thanks,
Paul

levicki · ‎05-20-2012

First rule of proper compiler-generated code performance comparison is:

Do not use constants for array size, number of iterations, etc.

To be sure that you are comparing the same code, you have to pass those values as a program argument so that they stay unknown to compiler until runtime.

Rationale:
Compiler may generate totally different code with different array sizes and iteration counts.

And that is exactly what happens here -- with ARRAY_SIZE of 32 and default inlining (which is /Ob2 or "inline anything suitable" if I remember correctly), compiler generates completely unrolled code for SSE and Vec versions of the inner loop. You can add the /FAs to your compiler options and compare the resulting assembler listings, and you will see the extent of the change yourself.

I did not analyze the code too closely, but from a brief look it seems to me that it is also rearranging the data layout before SSE computation. Furthermore, it seems to be using that rearranged data for Vec computation which explains why Vec has the fastest time with ARRAY_SIZE 32 (data is rearranged and pre-cached in L1).

If you add /Ob1, the compiler will generate identical code for SSE and Vec and the times will not change with ARRAY_SIZE.

Hope that answers your question :)

I have attached a hand-written assembler version of SSE code which beats Composer 2011 Update 10 by ~32.6%. IPP still has better performance for me though, most likely due to AVX being used on my CPU.

Edit:
After compiling this code with beta compiler version 13.0.0.041 my advice is not to waste time on writing intrinsic and Vec based code. Reason is twofold:

1. You need to write and maintain two versions (SSE and AVX)
2. Latest compiler generates code for C++ version which is 67% faster than my assembler code example (which I didn't bother to optimize too much), and only 15% slower than IPP. If AVX is enabled (/QxAVX switch) it comes within 3.5% of IPP code.

What I am curious about right now is why the speedup from SSE->AVX is not more bigger.

View solution in original post

TimP · ‎05-15-2012

In your auto-vectorized version, icl would unroll the loop so as to perform several iterations in a block; the remainder iterations not only would be performed scalar fashion but would incur more looping overhead than in a non-vector compilation. According to my understanding, Intel compilers assume by default that a vectorizable loop has a count of 100 for optimization. Sometimes, you will get better performance in a case such as yours by setting unroll0 (in compiler options, or by pragma). This would be expected to reduce performance of long loops but improve performance of moderate odd length loops.

paul_shu · ‎05-15-2012

You said it is caused by auto-vectorization of complier. But when I disable this optimization via adding option /Qvec-, the result doesn't change.

Om_S_Intel · ‎05-15-2012

You may use /Qvec-report2 to know which loop got vectorized.

c:\forum\U105196>icl DotProd.cpp /MD /O2 /QxHost /Qvec-report2 /D NDEBUG /TP /EHsc /W3 /nologo /D _CONSOLE /D _MBCS /D WIN32 /D_CRT_SECURE_NO_DEPRECATE /link ipps.lib /INCREMENTAL:NO /NOLOGO /DEBUG /FIXED:NO /OUT:DotProd.exe

DotProd.cpp

c:\forum\U105196\DotProd.cpp(128): (col. 2) remark: PARTIAL LOOP WAS VECTORIZED.

c:\forum\U105196\DotProd.cpp(139): (col. 3) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(137): (col. 2) remark: loop was not vectorized: notinner loop.

c:\forum\U105196\DotProd.cpp(152): (col. 3) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(150): (col. 2) remark: loop was not vectorized: notinner loop.

c:\forum\U105196\DotProd.cpp(165): (col. 3) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(163): (col. 2) remark: loop was not vectorized: notinner loop.

c:\forum\U105196\DotProd.cpp(176): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

c:\forum\U105196\DotProd.cpp(72): (col. 3) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(70): (col. 2) remark: loop was not vectorized: notinner loop.

c:\forum\U105196\DotProd.cpp(83): (col. 2) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(105): (col. 2) remark: loop was not vectorized: existence of vector dependence.

c:\forum\U105196\DotProd.cpp(57): (col. 2) remark: PARTIAL LOOP WAS VECTORIZED.

c:\forum\U105196\DotProd.cpp(57): (col. 2) remark: loop skipped: multiversioned.

Om_S_Intel · ‎05-15-2012

You need to have single stride access to array elements to get the code vectorized. Currently your code do not have it as you can see in the following statement (line 73). The pArray2 elements access pattern is
0, 2, 4, 6, 8 ... etc.

pResult[0] += pArray1 * pArray2[2*i];

paul_shu · ‎05-16-2012

Thanks for your explanations.

But the /Qvec-report2 option works only when the /Qvec option turns on. When I use the /Qvec- option to disable the vectorization, /Qvec-report2 doesn't print out any infomation. If the loop vectorized is disabled, why the SSE and Vec computations run so faster than IPP function?

levicki · ‎05-20-2012

First rule of proper compiler-generated code performance comparison is:

Do not use constants for array size, number of iterations, etc.

To be sure that you are comparing the same code, you have to pass those values as a program argument so that they stay unknown to compiler until runtime.

Rationale:
Compiler may generate totally different code with different array sizes and iteration counts.

And that is exactly what happens here -- with ARRAY_SIZE of 32 and default inlining (which is /Ob2 or "inline anything suitable" if I remember correctly), compiler generates completely unrolled code for SSE and Vec versions of the inner loop. You can add the /FAs to your compiler options and compare the resulting assembler listings, and you will see the extent of the change yourself.

I did not analyze the code too closely, but from a brief look it seems to me that it is also rearranging the data layout before SSE computation. Furthermore, it seems to be using that rearranged data for Vec computation which explains why Vec has the fastest time with ARRAY_SIZE 32 (data is rearranged and pre-cached in L1).

If you add /Ob1, the compiler will generate identical code for SSE and Vec and the times will not change with ARRAY_SIZE.

Hope that answers your question :)

I have attached a hand-written assembler version of SSE code which beats Composer 2011 Update 10 by ~32.6%. IPP still has better performance for me though, most likely due to AVX being used on my CPU.

Edit:
After compiling this code with beta compiler version 13.0.0.041 my advice is not to waste time on writing intrinsic and Vec based code. Reason is twofold:

1. You need to write and maintain two versions (SSE and AVX)
2. Latest compiler generates code for C++ version which is 67% faster than my assembler code example (which I didn't bother to optimize too much), and only 15% slower than IPP. If AVX is enabled (/QxAVX switch) it comes within 3.5% of IPP code.

What I am curious about right now is why the speedup from SSE->AVX is not more bigger.

paul_shu · ‎05-27-2012

Very good explanation.
Thanks,

paul_shu · ‎05-28-2012

Hi Igor,

When I compared your hand-written assembler version with IPP, I find the assembler code became slower than IPP function when array size is larger than 36. On my PC, AVX is not supported. I want to know what makes IPP run faster than your hand-written assembler code? What advance optimization techniques are adopted by IPP except SSE?

Thanks,
Paul

jimdempseyatthecove · ‎05-28-2012

Paul,

IMHO, Intel hires/trains exceptionally good people for optimizing code in IPP and MKL. I would venture to guess they have several 100's of man-years experience backing them up in this area. I view my programming ability at the high end for optimization (over 40 years of system level programming including writing program optimization tools). In the case of the MKL matrix multiply, I have been unable to produce better code. This is even after looking at the dissassembly and counting instruction cycles.

In several attempts where my instruction cycle counts were less,my codestill took longer to execute. On highly tuned code, after you have optimally tuned L1 cache, one has to furtherlook at internal issues with the specific processor design to avoid pipeline stalls. The documentation on how to effectively do this is vague at best. The Intel developers are able to do this because (IMHO) they are a tight knit group, able to work off of each others experience (they may also have a few "loaners" who keep their special skillsto themselves).

Jim Dempsey

levicki · ‎05-28-2012

Hi Paul,

My hand-written code was written in 30 minutes or so as a quick demonstration for you, I think that pretty much explains why it is not a very good performer :)

There is the possibility that they are repacking the complex array internaly into a temporary buffer prior to computation which would also have a side effect of pre-caching the data for actual computation in L1.

It is possible to split this loop into 3 smaller loops, one doing shuffling and storing into temporary buffer, another one doing multiplication, and final loop doing additions. If you are interested in trying to achieve high performance you could try to implement that and to see what results you will get with some profiling using VTune.

After you have maximum single-core performance you might consider threading if alogrithm allows it and of course that would require further tuning since memory bandwidth might be a bottleneck with more cores.

paul_shu · ‎05-28-2012

Thanks for your replay.