Small AVX speedup

magicfoot · ‎09-08-2011

My code conversion to avx from SSE with VS2010 Sp1 only gives a 15% speedup
for the simple code fragments shown below. SSE code runs 20seconds, AVX code runs 17 seconds.

Have I missed something obvious? I expected a speedup of maybe 50% at least.
Is this a realistic expectation? I expect my AVX code to run in about 11 to 12 seconds. Am I expecting too much ?

SSE CODE:

VS2010 /arch:sse

#include
__m128 *oh0;// same for all sse,single float for other variables defined by pointer below
pointer = (float **)_aligned_malloc(imax * sizeof(float *),32);

for (i = 1; i < ie; i++) {
for (j = 0; j < je; j+=4) {
oh0=(__m128 *)&ey;
oh1=(__m128 *)&caey;
oh2=(__m128 *)&cbey;
oh3=(__m128 *)&hz[i-1];
oh4=(__m128 *)&hz;
m1 = _mm_mul_ps(*oh0,*oh1);
m2 = _mm_sub_ps(*oh3,*oh4);
m3 = _mm_mul_ps(*oh2,m2);
m4 = _mm_add_ps(m1,m3);
_mm_store_ps(&ey,m4);
}
}

AVX CODE:

VS2010 SP1 /arch:avx

#include
__m256 *aoh0;// all avx defined this way. other variables as single float allocation in 2d arrays below
pointer = (float **)_aligned_malloc(imax * sizeof(float *),32);
for (i = 1; i < ie; i++) {
for (j = 0; j < je; j+=8) {
aoh0=(__m256 *)&ey;
aoh1=(__m256 *)&caey;
aoh2=(__m256 *)&cbey;
aoh3=(__m256 *)&hz[i-1];
aoh4=(__m256 *)&hz;
am1 = _mm256_mul_ps(*aoh0,*aoh1);
am2 = _mm256_sub_ps(*aoh3,*aoh4);
am3 = _mm256_mul_ps(*aoh2,am2);
am4 = _mm256_add_ps(am1,am3);
_mm256_store_ps(&ey,am4);
}
}

Brijender_B_Intel · ‎09-08-2011

It looks like your code is load limited. AVX will give you double processing but it may not help much until data comes to the processor. Also execution has a dependency chain. You may want to make sure:
1. ey is 32byte aligned
2. check if hz load is hurting as it i-1, may be chache line split . You may want to do two loads and shuffle up. Just a suggestion.
3. Load ey[j+8] also much before store (may be unrolling)