- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code conversion to avx from SSE with VS2010 Sp1 only gives a 15% speedup
for the simple code fragments shown below. SSE code runs 20seconds, AVX code runs 17 seconds.
Have I missed something obvious? I expected a speedup of maybe 50% at least.
Is this a realistic expectation? I expect my AVX code to run in about 11 to 12 seconds. Am I expecting too much ?
SSE CODE:
VS2010 /arch:sse
#include
__m128 *oh0;// same for all sse,single float for other variables defined by pointer below
pointer = (float **)_aligned_malloc(imax * sizeof(float *),32);
for (i = 1; i < ie; i++) {
for (j = 0; j < je; j+=4) {
oh0=(__m128 *)&ey
oh1=(__m128 *)&caey
oh2=(__m128 *)&cbey
oh3=(__m128 *)&hz[i-1]
oh4=(__m128 *)&hz
m1 = _mm_mul_ps(*oh0,*oh1);
m2 = _mm_sub_ps(*oh3,*oh4);
m3 = _mm_mul_ps(*oh2,m2);
m4 = _mm_add_ps(m1,m3);
_mm_store_ps(&ey
}
}
AVX CODE:
VS2010 SP1 /arch:avx
#include
__m256 *aoh0;// all avx defined this way. other variables as single float allocation in 2d arrays below
pointer = (float **)_aligned_malloc(imax * sizeof(float *),32);
for (i = 1; i < ie; i++) {
for (j = 0; j < je; j+=8) {
aoh0=(__m256 *)&ey
aoh1=(__m256 *)&caey
aoh2=(__m256 *)&cbey
aoh3=(__m256 *)&hz[i-1]
aoh4=(__m256 *)&hz
am1 = _mm256_mul_ps(*aoh0,*aoh1);
am2 = _mm256_sub_ps(*aoh3,*aoh4);
am3 = _mm256_mul_ps(*aoh2,am2);
am4 = _mm256_add_ps(am1,am3);
_mm256_store_ps(&ey
}
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. ey is 32byte aligned
2. check if hz load is hurting as it i-1, may be chache line split . You may want to do two loads and shuffle up. Just a suggestion.
3. Load ey[j+8] also much before store (may be unrolling)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page