Storing data is bottleneck?

Arthur_U_ · ‎01-09-2013

Hi,

I'm writing some example code of AVX like below:

double a[SIZE]__attribute__((aligned(32)));
double b[SIZE]__attribute__((aligned(32)));
double c[SIZE]__attribute__((aligned(32)));

srand(time(NULL));

for(inti=0; i<SIZE; i++) {
a = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
b = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
}
__m256d ymm0, ymm1, ymm2;

gettimeofday(&t0,NULL);
for(inti=0; i<SIZE; i+=4) {
ymm0 = _mm256_load_pd(a+i);
ymm1 = _mm256_load_pd(b+i);
ymm2 = _mm256_mul_pd(ymm0, ymm1);
_mm256_store_pd(c+i, ymm2);
}
gettimeofday(&t1,NULL);

double time1;
time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;

double sum;
for(inti=0; i<SIZE; i++) {
sum += c;
}

And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..

Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).

Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time?

(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only)

Thanks.

TimP · ‎01-09-2013

As the 256-bit store is split by current hardware, it's easily possible that the store takes 50% of the time if you change it to nontemporal, 70% if you leave as is on account of read for ownership. It would take more analysis to see if it's possible to explain why AVX intrinsics would slow it down. The compiler might be expected to choose reasonable unrolling for C source, while icc doesn't unroll intrinsics (gcc will do so under -funroll-loops and associated options). I assume you're not quoting the full compiler options, e.g. you must be using -O2 or -O3 (which implies auto-vectorization).

Arthur_U_ · ‎01-11-2013

Thank you for your reply. At first, I tried to apply nontemporal storing by using _mm256_stream_pd() but there seemed to be no change in execution time. And then I tried to quote -O2 and -O3. That reduced execution time (around 3.7~5.0e-04(sec)) but scalar version still returns better results (around 3.4~4.2e-04(sec)).. What do you think about these results ? Is auto-vectorization gcc does wiser than my intrinsic code?

TimP · ‎01-11-2013

gcc 4.7 and newer do a good job with AVX auto-vectorization (invoked by -O3 or -O2 -ftree-vectorize) for simple cases, which this appears to be. gcc will often drop to AVX-128 if it doesn't see alignment, which is often a good decision for current platforms. You'd want to examine the output code by -S or objdump -S, as well as the vectorizer report e.g. -ftree-vectorizer-verbose=2. You'd also want to investigate gcc unrolling control, e.g. -funroll-loops --param max-unroll-times=4 Compilers are getting more complex with options to control loop optimization.

Christian_M_2 · ‎01-13-2013

Hello,

I have some ideas.

First you might want to increase the size or repeat the test several times, to get more meaningful times. In my opinion times are too short for making predictions based on them.

Secondly, the Intel Intrinsics guide gave me a hint.

_mm256_mul_pd has a latency of 5 cycles. This is really very much in your loop. So you might want to try loop unrolling by yourself. Then I would do the loads and mul of iteration first. The same for iteration second. And then do the two stores. Or even better unroll 4 loop iterations. I think this should hide the latency and thus improve performance.

SergeyKostrov · ‎01-13-2013

>>... I tried to quote -O2 and -O3. That reduced execution time (around 3.7~5.0e-04(sec)) but scalar version still returns >>better results (around 3.4~4.2e-04(sec)).. >> >>What do you think about these results ? Arthur, You need to take into account an overhead of calls for all AVX intrinsic functions ( unless these calls are inlined! ): >>... >> gettimeofday(&t0, NULL); >> for (int i=0; i> { >> ymm0 = _mm256_load_pd(a+i); >> ymm1 = _mm256_load_pd(b+i); >> ymm2 = _mm256_mul_pd(ymm0, ymm1); >> _mm256_stream_pd(c+i, ymm2); >> } >> gettimeofday(&t1, NULL); These calls are affecting performance and that is why the scalar version is faster.

SergeyKostrov · ‎01-13-2013

I use a very simple "trick" ( already suggested by Christian ) to improve performance of some for loops: Instead of: ... double sum = 0.0L; for( int i=0; i less than SIZE; i++ ) { sum += c; } ... Use manual or #pragma directive based unrolling ( 4-in-1 or 8-in-1 ): ... double sum = 0.0L; for( int i=0; i less than SIZE; i+4 ) { sum += ( c + c[i+1] + c[i+2] + c[i+3] ); } ... Note: I added initialization to 0.0L of sum variable. [ EDITED ] Due to well known problems with arrow-left and arrow-right characters

SergeyKostrov · ‎01-13-2013

Here is a summary... >>...Storing data is bottleneck? No. It is an overhead of 400,000 calls to AVX intrinsic functions.

Bernard · ‎01-13-2013

Sergey Kostrov wrote:

Here is a summary...

>>...Storing data is bottleneck?

No. It is an overhead of 400,000 calls to AVX intrinsic functions.

Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.

Christian_M_2 · ‎01-14-2013

Sergey Kostrov wrote:

Here is a summary...

>>...Storing data is bottleneck?

No. It is an overhead of 400,000 calls to AVX intrinsic functions.

I think this is interesting. Visual Studio 2010 inlines the intrinsics. Generelly one might try using the option -Oi. If I am not mistaken, it tells the compiler to inline intrinsics generally. This should cut down the overhead.

SergeyKostrov · ‎01-14-2013

>>...Visual Studio 2010 inlines the intrinsics. Generelly one might try using the option -Oi. If I am not mistaken, >>it tells the compiler to inline intrinsics generally. This should cut down the overhead... Please try to do your own verification in the VS debugger and let us know. Thanks in advance. Best regards, Sergey

Christian_M_2 · ‎01-18-2013

I compiled the following code in VS 2010

>> for (int i=0; i >> {
>> ymm0 = _mm256_load_pd(a+i);
>> ymm1 = _mm256_load_pd(b+i);
>> ymm2 = _mm256_mul_pd(ymm0, ymm1);
>> _mm256_stream_pd(c+i, ymm2);
>> }

In normal release build config, you directly get ssembler instructions. I checked the disassembly (using a breakpoint).

The thing is that VS 2010 in standard release config has option /Oi on. Removing this option generated the same code (for this loop).

SergeyKostrov · ‎04-11-2013

>>Arthur, You need to take into account an overhead of calls for all AVX intrinsic functions ( unless these calls are inlined! ): >> >>>>... >>>> gettimeofday(&t0, NULL); >>>> for (int i=0; i >> { >>>> ymm0 = _mm256_load_pd(a+i); >>>> ymm1 = _mm256_load_pd(b+i); >>>> ymm2 = _mm256_mul_pd(ymm0, ymm1); >>>> _mm256_stream_pd(c+i, ymm2); >>>> } >>>> gettimeofday(&t1, NULL); >> >>These calls are affecting performance and that is why the scalar version is faster. I'm experiencing a similar problem and I see when intrinsic functions are Not inlined performance is really affected ( slower by ~4 times! ).

Bernard · ‎04-11-2013

iliyapolak wrote:

Quote:

Sergey Kostrovwrote:
Here is a summary...

>>...Storing data is bottleneck?

No. It is an overhead of 400,000 calls to AVX intrinsic functions.

Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.

I have forgotten to add the overhead of ret instruction.

SergeyKostrov · ‎04-12-2013

Hi Christian, >>...From my experience I can only advice you to use VS2012. VS2010 produces sometimes code that has very poor performance... The problem is applicable to Intel ( version 13 Update 2 ) and Microsoft C++ compilers with Visual Studio 2008. Also, if I don't use intrinsic functions and use /O2 and /fp:fast, or /O3 and /fp:fast=2 ( for Intel ) compiler options than the code is significantly faster (!).

bronxzv · ‎04-13-2013

Sergey Kostrov wrote:

I'm experiencing a similar problem and I see when intrinsic functions are Not inlined performance is really affected ( slower by ~4 times! ).

oughhh! that's scary

TimP · ‎04-15-2013

/O2 /fp:fast /arch:SSE2|AVX in VS2012 is roughly equivalent to ICL /O2 /fp:source /Qansi-alias /arch:... These will auto-vectorize some of the simpler situations without requiring intrinsics. I hate to comment again about /fp:fast having different meanings among these compilers.

ICL will auto-vectorize more situations with /O3 and /fp:fast, or with substitution of CEAN or pragmas. CEAN inherently includes effect of /Qansi-alias and pragmas vector always and ivdep.

ICL /Qansi-alias /Qcomplex-limited-range /arch:SSE4.1 is roughly equivalent to gcc -O3 -ffast-math -march=corei7.

I could believe that /Oi- or /Od (or debug build mode) disable in-line expansion of intrinsics in one or more compilers, but I haven't studied this. I'm not clear if that was what was meant in this thread.

SergeyKostrov · ‎04-15-2013

I've just completed additional investigation and here is my report ( it is applicable to Intel and Microsoft C++ compilers ): 1. Let's say there is some Algorithm. 2. Two versions of the Algorithm are implemented. That is, Without Intrinsics Functions ( Pure C / Version 1 ) and With Intrinsic Functions ( Version 2 ). 3. When all optimizations are Disabled, for example in Debug configuration, the Version 2 could outpertform Version 1. 4. When /O2 or /O3 and /fp:fast or /fp:fast=2 optimizations are Enabled, in Release configuration, then Version 1 outperforms Version 2 ( ~3.5 times for Intel and ~2 times for Microsoft C++ compilers ). 5. I verified generated *.asm files and I was able to see that both C++ compilers generated very efficient assembler codes for Version 1 of the Algorithm and it means, that application of intrinsic functions in some cases doesn't help to improve performance (!). 6. I see that in my case I wasted time on implementation and testing of some Algorithm With Intrinsic Functions. 7. Ideally, it always makes sense to implement two versions of some algorithm ( when a developer has time ) and compare performance of both versions when the most aggressive optimizations are Enabled.