- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why does the optimization crush the code so much it changes the outcome of the code and how can I stop it?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the results change entirely, it's probably because optimization makes the result more sensitive to latent bugs, e.g. undefined data, pointer over-runs, ...
However, you should also investigate safe options such as including /fp:source, which will disable optimizations which are outside the C and C++ standards.
Besides turning on options which may flag uninitialized data and the like, Other investigative tools include enabling vectorization loop by loop,...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why did you set unrolling value to 32?
If you have Haswell CPU it can perform two load/store addresses operations per clock.I am not sure if unrolling 32x will be helpful in your case.You can try to set unrolling by 2 and measure the memory reading performance.
movaps xmm0,xmmword ptr[esi] //esi == input pointer
movaps xmm1,xmmword ptr[esi+16]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bool Read() function will also execute slower because of bitwise xor operator when each value of data array will be xored with errxor variable.If I am not wrong there could be also an issue of loads blocked by store forwarding errxor = data^errxor;
Can you run VTune analysis on your code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Loop logic can be executed in parallel.Port0 can increment int variable and Port5 can calculate branch of course execution of the loop code must wait for the branch result.If loop was vectorised successfully then Port1 could handle vector int arithmetics.
VTune analysis could shed more light on performance issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Btw if you want you can also measure loop overhead only without executing code inside the loop.Moreover unrolling by such large number could increase register pressure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without modifying the Read or Write methods the write performance is slower with a vector down from about 16 to about 10GB/s.
After inspecting the disassembly the vector array has no intrinsics applied and is not being unrolled, with loop unrolling set to 16 this is the disassembly of the Write loop with a vector:
0FC5126C mov esi,eax
0FC5126E inc eax
0FC5126F shl esi,4
0FC51272 mov edi,dword ptr [data (0FC54980h)]
0FC51278 mov dword ptr [edi+esi],ecx
0FC5127B mov dword ptr [edi+esi+4],edx
0FC5127F mov edi,dword ptr [data (0FC54980h)]
0FC51285 mov dword ptr [edi+esi+8],ecx
0FC51289 mov dword ptr [edi+esi+0Ch],edx
0FC5128D cmp eax,4000000h
0FC51292 jb Write+0Ch (0FC5126Ch)
and this is a normal array:
0F621080 vmovntdq xmmword ptr [ecx+esi*8],xmm0
0F621085 vmovntdq xmmword ptr [ecx+esi*8+10h],xmm0
0F62108B vmovntdq xmmword ptr [ecx+esi*8+20h],xmm0
0F621091 vmovntdq xmmword ptr [ecx+esi*8+30h],xmm0
0F621097 vmovntdq xmmword ptr [ecx+esi*8+40h],xmm0
0F62109D vmovntdq xmmword ptr [ecx+esi*8+50h],xmm0
0F6210A3 vmovntdq xmmword ptr [ecx+esi*8+60h],xmm0
0F6210A9 vmovntdq xmmword ptr [ecx+esi*8+70h],xmm0
0F6210AF vmovntdq xmmword ptr [ecx+esi*8+80h],xmm0
0F6210B8 vmovntdq xmmword ptr [ecx+esi*8+90h],xmm0
0F6210C1 vmovntdq xmmword ptr [ecx+esi*8+0A0h],xmm0
0F6210CA vmovntdq xmmword ptr [ecx+esi*8+0B0h],xmm0
0F6210D3 vmovntdq xmmword ptr [ecx+esi*8+0C0h],xmm0
0F6210DC vmovntdq xmmword ptr [ecx+esi*8+0D0h],xmm0
0F6210E5 vmovntdq xmmword ptr [ecx+esi*8+0E0h],xmm0
0F6210EE vmovntdq xmmword ptr [ecx+esi*8+0F0h],xmm0
0F6210F7 add esi,20h
0F6210FA cmp esi,eax
0F6210FC jb Write+50h (0F621080h)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compiler decided to use only one XMM register(XMM0) to hold the function argument and I think that CPU Port2 and Port3 can still issue two memory stores per clock.I do not know why compiler did not precalculated the pointer offset in advance thus probably diminishing the load on the AGU.
vmovntdq xmmword ptr [ecx+esi],xmm0
vmovntdq xmmword ptr [ecx+esi+16],xmm0
vmovntdq xmmword ptr [ecx+esi+32],xmm0
// unrolling code continues
add esi,256
cmp esi,eax
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So I need to find out why its not using the other registers...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The best option is to perform VTune analysis and post the results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What type of analysis in VTune?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That might explain why I'm getting a higher read speed:
77B611F0 vpxor xmm0,xmm0,xmmword ptr [edi+eax*8]
77B611F5 vpxor xmm1,xmm0,xmmword ptr [edi+eax*8+10h]
77B611FB vpxor xmm2,xmm1,xmmword ptr [edi+eax*8+20h]
77B61201 vpxor xmm3,xmm2,xmmword ptr [edi+eax*8+30h]
77B61207 vpxor xmm4,xmm3,xmmword ptr [edi+eax*8+40h]
77B6120D vpxor xmm5,xmm4,xmmword ptr [edi+eax*8+50h]
77B61213 vpxor xmm6,xmm5,xmmword ptr [edi+eax*8+60h]
77B61219 vpxor xmm7,xmm6,xmmword ptr [edi+eax*8+70h]
77B6121F vpxor xmm0,xmm7,xmmword ptr [edi+eax*8+80h]
77B61228 vpxor xmm1,xmm0,xmmword ptr [edi+eax*8+90h]
77B61231 vpxor xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]
77B6123A vpxor xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]
77B61243 vpxor xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]
77B6124C vpxor xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]
77B61255 vpxor xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]
77B6125E vpxor xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]
77B61267 add eax,20h
77B6126A cmp eax,ecx
77B6126C jb Read+50h (77B611F0h)
its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Taking a guess about what you may have done or plan to do, you may need to specify and assert 32-byte alignment to get ymm moves for data which aren't defined as m256 types. Your code excerpt in #3 above would produce 16-byte alignment (you have no control although it may happen to be 32-byte aligned). On Sandy Bridge corei7-2, splitting moves down to 128-bits is a big advantage when there are misalignments, and may still prove better on Ivy Bridge corei7-3.
In my intrinsics code I make the switch to AVX instructions conditional on building for AVX2, as I find the SSE2 intrinsics running faster with 50% of data not aligned on early AVX platforms. In turn, I make the use of SSE2 intrinsics conditional on building for SSE3, since early SSE2 platforms performed poorly on unaligned data. Intel C++ promotes those SSE2 intrinsics to AVX-128 such as you show when AVX is set, and drops the vzeroupper which has to be inserted afterwards in order to use MSVC or gcc to run SSE2 with adequate performance on an AVX platform.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code is still the same as it is in post #3, could you show me how do this alignment?
I tried "#pragma vector aligned" but it made no difference.
Any improvements to my code are quite welcome.
I set the vectorizer diagnostic level to 6 and this is what it said:
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZED
what does it want me to do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CommanderLake wrote:
That might explain why I'm getting a higher read speed:
77B611F0 vpxor xmm0,xmm0,xmmword ptr [edi+eax*8]
77B611F5 vpxor xmm1,xmm0,xmmword ptr [edi+eax*8+10h]
77B611FB vpxor xmm2,xmm1,xmmword ptr [edi+eax*8+20h]
77B61201 vpxor xmm3,xmm2,xmmword ptr [edi+eax*8+30h]
77B61207 vpxor xmm4,xmm3,xmmword ptr [edi+eax*8+40h]
77B6120D vpxor xmm5,xmm4,xmmword ptr [edi+eax*8+50h]
77B61213 vpxor xmm6,xmm5,xmmword ptr [edi+eax*8+60h]
77B61219 vpxor xmm7,xmm6,xmmword ptr [edi+eax*8+70h]
77B6121F vpxor xmm0,xmm7,xmmword ptr [edi+eax*8+80h]
77B61228 vpxor xmm1,xmm0,xmmword ptr [edi+eax*8+90h]
77B61231 vpxor xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]
77B6123A vpxor xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]
77B61243 vpxor xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]
77B6124C vpxor xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]
77B61255 vpxor xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]
77B6125E vpxor xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]
77B61267 add eax,20h
77B6126A cmp eax,ecx
77B6126C jb Read+50h (77B611F0h)its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?
Actually in this case probably two loads can be executed in parallel,but only one of them will be xored at the same time because of variable interdependency.I suppose that this kind unrolling puts a lot of pressure on register usage , but could transfer more data into cache lines (probably by hardware prefetching) when the CPU is busy executing vpxor instructions. Also vpxor is probably cached because its exhibits high frequency of usage.
Regarding VTune perform at the beginning Bandwidth Analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CommanderLake wrote:
My code is still the same as it is in post #3, could you show me how do this alignment?
I tried "#pragma vector aligned" but it made no difference.
Any improvements to my code are quite welcome.
I set the vectorizer diagnostic level to 6 and this is what it said:
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZEDwhat does it want me to do?
That means that loop was vectorized by using xmm registers and streaming stores was used to transfer data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your data are defined without alignment where the compiler can see it, the pragma vector aligned may be ignored. Of course, if the compiler doesn't see the lack of alignment and applies the pragma, your code may break. You ask us to suggest improvements without showing us what you are currently using.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I said my code is still pretty much what it was in post #3 but I just tried splitting the array into 2 and even 4 arrays and I got a performance improvement but it doesn't seem to want to go over 20GB/s:
What I was trying to get at with this thread originally was to do with the error checking, the compiler assumes errxor is of no significance so the optimization makes it harder to do the error checking.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"//data0=(unsigned long long*)_mm_malloc(268435456,16);"
You can pass value 64 as a alignment argument to _mm_malloc() , but it must go with the memory being allocated in multiplies of one page(4KB).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page