- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to figure our what compiler flags to use in order to get the correct and most optimal results from AVX intrinsics. I have put together a small example below showing the problems I am having. Basically, using unaligned AVX loads/stores does not produce optimal code and I can't seem to find the right combination of flags to get the compiler to do the right thing.
I have included 3 sample outputs from very simple test code which illustrate the problems.
Any help or insight would be greatly appreciated. Thank you.
Mike
//-----------------------------------------------------------------------------
// Below shows that using unaligned loads/stores produces suboptimal code.
// Why do these simply not translate to "vmovups ymm#,r##"?
//
Using:
icc -S -use-msasm -xavx -fsource-asm main.c
..___tag_value_vavx_test.32: #17.1
### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_loadu_ps( a );
vmovups (%rdi), %xmm0 #21.28
### vb0 = _mm256_loadu_ps( b );
vmovups (%rsi), %xmm1 #22.28
vinsertf128 $1, 16(%rsi), %ymm1, %ymm3 #22.28
vinsertf128 $1, 16(%rdi), %ymm0, %ymm2 #21.28
### vc0 = _mm256_add_ps( va0, vb0 );
vaddps %ymm3, %ymm2, %ymm4 #23.11
### _mm256_storeu_ps( c, vc0 );
vmovups %xmm4, (%rdx) #24.23
vextractf128 $1, %ymm4, 16(%rdx) #24.23
###
### vc128 = _mm256_extractf128_ps( vc0, 0 );
### _mm_store_ss( c, vc128 );
vmovss %xmm4, (%rdx) #27.5
###
### _mm256_zeroupper();
vzeroupper #29.2
###
###
### return;
ret #32.5
//-----------------------------------------------------------------------------
// Below demonstrates that using aligned loads/stores produces unaligned
// load/stores in assembler. This is fine, but is of concern if switching
// compilers or flags as other settings produce aligned instructions.
//
Using:
icc -S -use-msasm -xavx -fsource-asm main.c
..___tag_value_vavx_test.32: #17.1
### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_load_ps( a );
vmovups (%rdi), %ymm0 #21.27
### vb0 = _mm256_load_ps( b );
### vc0 = _mm256_add_ps( va0, vb0 );
vaddps (%rsi), %ymm0, %ymm1 #23.11
### _mm256_store_ps( c, vc0 );
vmovups %ymm1, (%rdx) #24.22
###
### vc128 = _mm256_extractf128_ps( vc0, 0 );
### _mm_store_ss( c, vc128 );
vmovss %xmm1, (%rdx) #27.5
###
### _mm256_zeroupper();
vzeroupper #29.2
###
###
### return;
ret #32.5
//-----------------------------------------------------------------------------
// Below demonstrates that using -xhost withunaligned loads/stores produces the correct
// unaligned load/stores in assembler. But the compiler also generates SSE
// instructions mixed in with the code and does not even zeroupper, causing
// severe penalties.
// (note, this is compiled on a SandyBridge system running RHEL6.1)
//
Using:
icc -S -use-msasm -xhost -fsource-asm main.c
..___tag_value_vavx_test.32: #17.1
### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_loadu_ps( a );
vmovups (%rdi), %ymm0 #21.28
### vb0 = _mm256_loadu_ps( b );
### vc0 = _mm256_add_ps( va0, vb0 );
vaddps (%rsi), %ymm0, %ymm1 #23.11
### _mm256_storeu_ps( c, vc0 );
vmovups %ymm1, (%rdx) #24.23
###
### vc128 = _mm256_extractf128_ps( vc0, 0 );
### _mm_store_ss( c, vc128 );
movss %xmm1, (%rdx) #27.5
###
### _mm256_zeroupper();
vzeroupper #29.2
###
###
### return;
ret #32.5
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler is trying to avoid split cache lines causing a performance decrease. Hence why it splits the 256-byte load into two 128-byte loads. If you can get the compiler to recognize that the pointers are 32-byte aligned, it won't do that, but then that's why you're using unaligned load intrinsics in the first place, because you don't know that for sure.
For case 2:
Since you're using aligned load intrinsics, the compiler knows there's no risk of a split cache line and can do the entire 256-byte load at once. It still uses vmovups because there is no significant cost to using vmovups vs. vmovaps on an aligned data, and using vmovups avoids a crash on the chance that the data really is unaligned after all.
For case 3:
I've just confirmed that -xhost maps to -xavx on my Core i7 2nd generation system here. What exact compiler are you using (-V will show this), and can you send me the output you get after adding the -dryrun option?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page