- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi,
I'm writing some example code of AVX like below:
double a[SIZE]__attribute__((aligned(32)));
double b[SIZE]__attribute__((aligned(32)));
double c[SIZE]__attribute__((aligned(32)));
srand(time(NULL));
for(inti=0; i<SIZE; i++) {
a = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
b = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
}
__m256d ymm0, ymm1, ymm2;
gettimeofday(&t0,NULL);
for(inti=0; i<SIZE; i+=4) {
ymm0 = _mm256_load_pd(a+i);
ymm1 = _mm256_load_pd(b+i);
ymm2 = _mm256_mul_pd(ymm0, ymm1);
_mm256_store_pd(c+i, ymm2);
}
gettimeofday(&t1,NULL);
double time1;
time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;
double sum;
for(inti=0; i<SIZE; i++) {
sum += c;
}
And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..
Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).
Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time?
(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only)
Thanks.
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hello,
I have some ideas.
First you might want to increase the size or repeat the test several times, to get more meaningful times. In my opinion times are too short for making predictions based on them.
Secondly, the Intel Intrinsics guide gave me a hint.
_mm256_mul_pd has a latency of 5 cycles. This is really very much in your loop. So you might want to try loop unrolling by yourself. Then I would do the loads and mul of iteration first. The same for iteration second. And then do the two stores. Or even better unroll 4 loop iterations. I think this should hide the latency and thus improve performance.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Sergey Kostrov wrote:
Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Sergey Kostrov wrote:
Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
I think this is interesting. Visual Studio 2010 inlines the intrinsics. Generelly one might try using the option -Oi. If I am not mistaken, it tells the compiler to inline intrinsics generally. This should cut down the overhead.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I compiled the following code in VS 2010
>> for (int i=0; i >> {
>> ymm0 = _mm256_load_pd(a+i);
>> ymm1 = _mm256_load_pd(b+i);
>> ymm2 = _mm256_mul_pd(ymm0, ymm1);
>> _mm256_stream_pd(c+i, ymm2);
>> }
In normal release build config, you directly get ssembler instructions. I checked the disassembly (using a breakpoint).
The thing is that VS 2010 in standard release config has option /Oi on. Removing this option generated the same code (for this loop).
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
iliyapolak wrote:
Quote:
Sergey Kostrovwrote:Here is a summary...
>>...Storing data is bottleneck?
No. It is an overhead of 400,000 calls to AVX intrinsic functions.
Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.
I have forgotten to add the overhead of ret instruction.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Sergey Kostrov wrote:
I'm experiencing a similar problem and I see when intrinsic functions are Not inlined performance is really affected ( slower by ~4 times! ).
oughhh! that's scary
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
/O2 /fp:fast /arch:SSE2|AVX in VS2012 is roughly equivalent to ICL /O2 /fp:source /Qansi-alias /arch:... These will auto-vectorize some of the simpler situations without requiring intrinsics. I hate to comment again about /fp:fast having different meanings among these compilers.
ICL will auto-vectorize more situations with /O3 and /fp:fast, or with substitution of CEAN or pragmas. CEAN inherently includes effect of /Qansi-alias and pragmas vector always and ivdep.
ICL /Qansi-alias /Qcomplex-limited-range /arch:SSE4.1 is roughly equivalent to gcc -O3 -ffast-math -march=corei7.
I could believe that /Oi- or /Od (or debug build mode) disable in-line expansion of intrinsics in one or more compilers, but I haven't studied this. I'm not clear if that was what was meant in this thread.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora