- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm writing some example code of AVX like below:
double a[SIZE]__attribute__((aligned(32)));
double b[SIZE]__attribute__((aligned(32)));
double c[SIZE]__attribute__((aligned(32)));
srand(time(NULL));
for(inti=0; i<SIZE; i++) {
a = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
b = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
}
__m256d ymm0, ymm1, ymm2;
gettimeofday(&t0,NULL);
for(inti=0; i<SIZE; i+=4) {
ymm0 = _mm256_load_pd(a+i);
ymm1 = _mm256_load_pd(b+i);
ymm2 = _mm256_mul_pd(ymm0, ymm1);
_mm256_store_pd(c+i, ymm2);
}
gettimeofday(&t1,NULL);
double time1;
time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;
double sum;
for(inti=0; i<SIZE; i++) {
sum += c;
}
And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..
Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).
Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time?
(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only)
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have some ideas.
First you might want to increase the size or repeat the test several times, to get more meaningful times. In my opinion times are too short for making predictions based on them.
Secondly, the Intel Intrinsics guide gave me a hint.
_mm256_mul_pd has a latency of 5 cycles. This is really very much in your loop. So you might want to try loop unrolling by yourself. Then I would do the loads and mul of iteration first. The same for iteration second. And then do the two stores. Or even better unroll 4 loop iterations. I think this should hide the latency and thus improve performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
I think this is interesting. Visual Studio 2010 inlines the intrinsics. Generelly one might try using the option -Oi. If I am not mistaken, it tells the compiler to inline intrinsics generally. This should cut down the overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I compiled the following code in VS 2010
>> for (int i=0; i >> {
>> ymm0 = _mm256_load_pd(a+i);
>> ymm1 = _mm256_load_pd(b+i);
>> ymm2 = _mm256_mul_pd(ymm0, ymm1);
>> _mm256_stream_pd(c+i, ymm2);
>> }
In normal release build config, you directly get ssembler instructions. I checked the disassembly (using a breakpoint).
The thing is that VS 2010 in standard release config has option /Oi on. Removing this option generated the same code (for this loop).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Quote:
Sergey Kostrovwrote:Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.
I have forgotten to add the overhead of ret instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
I'm experiencing a similar problem and I see when intrinsic functions are Not inlined performance is really affected ( slower by ~4 times! ).
oughhh! that's scary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/O2 /fp:fast /arch:SSE2|AVX in VS2012 is roughly equivalent to ICL /O2 /fp:source /Qansi-alias /arch:... These will auto-vectorize some of the simpler situations without requiring intrinsics. I hate to comment again about /fp:fast having different meanings among these compilers.
ICL will auto-vectorize more situations with /O3 and /fp:fast, or with substitution of CEAN or pragmas. CEAN inherently includes effect of /Qansi-alias and pragmas vector always and ivdep.
ICL /Qansi-alias /Qcomplex-limited-range /arch:SSE4.1 is roughly equivalent to gcc -O3 -ffast-math -march=corei7.
I could believe that /Oi- or /Od (or debug build mode) disable in-line expansion of intrinsics in one or more compilers, but I haven't studied this. I'm not clear if that was what was meant in this thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page