Community
cancel
Showing results for 
Search instead for 
Did you mean: 
DLake1
New Contributor I
87 Views

Corruption with the optimization compiler option

Why does the optimization crush the code so much it changes the outcome of the code and how can I stop it?

0 Kudos
35 Replies
TimP
Black Belt
64 Views

If the results change entirely, it's probably because optimization makes the result more sensitive to latent bugs, e.g. undefined data, pointer over-runs, ...  

However, you should also investigate safe options such as including /fp:source, which will disable optimizations which are outside the C and C++ standards.

Besides turning on options which may flag uninitialized data and the like, Other investigative tools include enabling vectorization loop by loop,...

 

DLake1
New Contributor I
64 Views

I managed to get my code working with optimization on. Now if I may continue on this thread what I'm trying to do is create a simple RAM bandwidth test, I'm using a C# application for a basic interface and a very small dll with the following methods that I PInvoke and time with the Stopwatch in .net 4.0: unsigned long long* data; bool Create(){ data=new unsigned long long[134217728]; } bool Destroy(){ delete[] data; } bool Write(unsigned long long dval){ for(auto i=0;i<134217728;++i){data=dval;} return true; } bool Read(unsigned long long dval){ auto errxor=dval; for(auto i=0;i<134217728;++i){errxor=data^errxor;} return errxor!=dval; } The write speed is pretty good but the read performance is lacking, I have AVX intrinsics and parallelization enabled and loop unrolling set to 32, on inspection of the assembly code it appears to be using 128bit instructions. I am unsure of the best way to test the read performance, could you suggest any improvements?
Bernard
Black Belt
64 Views

Why did you set unrolling value to 32?

If you have Haswell CPU it can perform two load/store addresses operations per clock.I am not sure if unrolling 32x will be helpful in your case.You can try to set unrolling by 2 and measure the memory reading performance.

movaps xmm0,xmmword ptr[esi] //esi == input pointer

movaps xmm1,xmmword ptr[esi+16]

Bernard
Black Belt
64 Views

bool Read() function will also execute slower because of bitwise xor operator when each value of data array will be xored with errxor variable.If I am not wrong there could be also an issue of loads blocked by store forwarding errxor = data^errxor;

Can you run VTune analysis on your code?

DLake1
New Contributor I
64 Views

I have a Sandy Bridge E i7 3820 4.3GHz with quad channel 2133MHz RAM. A loop does other stuff like increment the loop variable that creates overhead so the more you can do per loop the less impact the overhead has that's why I'm unrolling it, it doesn't do a whole lot but high numbers do squeeze out a few hundred more MB/s, I'm only trying to learn about this stuff its nothing important. I want it to detect errors too but cant think of a better way to do it, the XOR is faster than "if(data!=dval)error=true;" by about 18GB/s to 16GB/s(with Qparallel off), I know my ram will go about 50GB/s. What information would you like from the VTune analysis?
Bernard
Black Belt
64 Views

Loop logic can be executed in parallel.Port0 can  increment int variable and Port5 can calculate branch of course execution of the loop code must wait for the branch result.If loop was vectorised successfully then Port1 could handle  vector int arithmetics.

VTune analysis could shed more light on performance issues.

Bernard
Black Belt
64 Views

Btw if you want you can also measure loop overhead only without executing code inside the loop.Moreover unrolling by such large number could increase register pressure.

DLake1
New Contributor I
64 Views

Without modifying the Read or Write methods the write performance is slower with a vector down from about 16 to about 10GB/s.

After inspecting the disassembly the vector array has no intrinsics applied and is not being unrolled, with loop unrolling set to 16 this is the disassembly of the Write loop with a vector:

0FC5126C  mov         esi,eax  
0FC5126E  inc         eax  
0FC5126F  shl         esi,4  
0FC51272  mov         edi,dword ptr [data (0FC54980h)]  
0FC51278  mov         dword ptr [edi+esi],ecx  
0FC5127B  mov         dword ptr [edi+esi+4],edx  
0FC5127F  mov         edi,dword ptr [data (0FC54980h)]  
0FC51285  mov         dword ptr [edi+esi+8],ecx  
0FC51289  mov         dword ptr [edi+esi+0Ch],edx  
0FC5128D  cmp         eax,4000000h  
0FC51292  jb          Write+0Ch (0FC5126Ch)

and this is a normal array:

0F621080  vmovntdq    xmmword ptr [ecx+esi*8],xmm0  
0F621085  vmovntdq    xmmword ptr [ecx+esi*8+10h],xmm0  
0F62108B  vmovntdq    xmmword ptr [ecx+esi*8+20h],xmm0  
0F621091  vmovntdq    xmmword ptr [ecx+esi*8+30h],xmm0  
0F621097  vmovntdq    xmmword ptr [ecx+esi*8+40h],xmm0  
0F62109D  vmovntdq    xmmword ptr [ecx+esi*8+50h],xmm0  
0F6210A3  vmovntdq    xmmword ptr [ecx+esi*8+60h],xmm0  
0F6210A9  vmovntdq    xmmword ptr [ecx+esi*8+70h],xmm0  
0F6210AF  vmovntdq    xmmword ptr [ecx+esi*8+80h],xmm0  
0F6210B8  vmovntdq    xmmword ptr [ecx+esi*8+90h],xmm0  
0F6210C1  vmovntdq    xmmword ptr [ecx+esi*8+0A0h],xmm0  
0F6210CA  vmovntdq    xmmword ptr [ecx+esi*8+0B0h],xmm0  
0F6210D3  vmovntdq    xmmword ptr [ecx+esi*8+0C0h],xmm0  
0F6210DC  vmovntdq    xmmword ptr [ecx+esi*8+0D0h],xmm0  
0F6210E5  vmovntdq    xmmword ptr [ecx+esi*8+0E0h],xmm0  
0F6210EE  vmovntdq    xmmword ptr [ecx+esi*8+0F0h],xmm0  
0F6210F7  add         esi,20h  
0F6210FA  cmp         esi,eax  
0F6210FC  jb          Write+50h (0F621080h) 

Bernard
Black Belt
64 Views

Compiler decided to use only one XMM register(XMM0)  to hold the function argument and I think that CPU Port2 and Port3 can still issue two memory stores per clock.I do not know why compiler did not precalculated the pointer offset in advance thus probably diminishing the load on the AGU.

 

vmovntdq  xmmword ptr [ecx+esi],xmm0

vmovntdq xmmword ptr [ecx+esi+16],xmm0

vmovntdq xmmword ptr [ecx+esi+32],xmm0

// unrolling code continues

add esi,256

cmp esi,eax

 

DLake1
New Contributor I
64 Views

So I need to find out why its not using the other registers...

Bernard
Black Belt
64 Views

The best option is to perform VTune analysis and post the results.

DLake1
New Contributor I
64 Views

What type of analysis in VTune?

DLake1
New Contributor I
64 Views

That might explain why I'm getting a higher read speed:

77B611F0  vpxor       xmm0,xmm0,xmmword ptr [edi+eax*8]  
77B611F5  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+10h]  
77B611FB  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+20h]  
77B61201  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+30h]  
77B61207  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+40h]  
77B6120D  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+50h]  
77B61213  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+60h]  
77B61219  vpxor       xmm7,xmm6,xmmword ptr [edi+eax*8+70h]  
77B6121F  vpxor       xmm0,xmm7,xmmword ptr [edi+eax*8+80h]  
77B61228  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+90h]  
77B61231  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]  
77B6123A  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]  
77B61243  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]  
77B6124C  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]  
77B61255  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]  
77B6125E  vpxor       xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]  
77B61267  add         eax,20h  
77B6126A  cmp         eax,ecx  
77B6126C  jb          Read+50h (77B611F0h) 

its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?

TimP
Black Belt
64 Views

Taking a guess about what you may have done or plan to do, you may need to specify and assert 32-byte alignment to get ymm moves for data which aren't defined as m256 types.  Your code excerpt in #3 above would produce 16-byte alignment (you have no control although it may happen to be 32-byte aligned). On Sandy Bridge corei7-2, splitting moves down to 128-bits is a big advantage when there are misalignments, and may still prove better on Ivy Bridge corei7-3. 

In my intrinsics code I make the switch to AVX instructions conditional on building for AVX2, as I find the SSE2 intrinsics running faster with 50% of data not aligned on early AVX platforms. In turn, I make the use of SSE2 intrinsics conditional on building for SSE3, since early SSE2 platforms performed poorly on unaligned data.  Intel C++ promotes those SSE2 intrinsics to AVX-128 such as you show when AVX is set, and drops the vzeroupper which has to be inserted afterwards in order to use MSVC or gcc to run SSE2 with adequate performance on an AVX platform.

DLake1
New Contributor I
64 Views

My code is still the same as it is in post #3, could you show me how do this alignment?

I tried "#pragma vector aligned" but it made no difference.

Any improvements to my code are quite welcome.

I set the vectorizer diagnostic level to 6 and this is what it said:

warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZED

what does it want me to do?

Bernard
Black Belt
64 Views

CommanderLake wrote:

That might explain why I'm getting a higher read speed:

77B611F0  vpxor       xmm0,xmm0,xmmword ptr [edi+eax*8]  
77B611F5  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+10h]  
77B611FB  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+20h]  
77B61201  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+30h]  
77B61207  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+40h]  
77B6120D  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+50h]  
77B61213  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+60h]  
77B61219  vpxor       xmm7,xmm6,xmmword ptr [edi+eax*8+70h]  
77B6121F  vpxor       xmm0,xmm7,xmmword ptr [edi+eax*8+80h]  
77B61228  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+90h]  
77B61231  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]  
77B6123A  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]  
77B61243  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]  
77B6124C  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]  
77B61255  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]  
77B6125E  vpxor       xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]  
77B61267  add         eax,20h  
77B6126A  cmp         eax,ecx  
77B6126C  jb          Read+50h (77B611F0h)

its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?

Actually in this case probably two loads can be executed in parallel,but only one of them will be xored at the same time because of variable interdependency.I suppose that this kind unrolling puts a lot of pressure on register usage , but could transfer more data into cache lines (probably by hardware prefetching) when the CPU is busy executing vpxor instructions. Also vpxor is probably cached because its exhibits high frequency of usage.

Regarding VTune perform at the beginning Bandwidth Analysis.

http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-96...

Bernard
Black Belt
64 Views

CommanderLake wrote:

My code is still the same as it is in post #3, could you show me how do this alignment?

I tried "#pragma vector aligned" but it made no difference.

Any improvements to my code are quite welcome.

I set the vectorizer diagnostic level to 6 and this is what it said:

warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZED

what does it want me to do?

That means that loop was vectorized by using xmm registers and streaming stores was used to transfer data.

TimP
Black Belt
64 Views

If your data are defined without alignment where the compiler can see it, the pragma vector aligned may be ignored.  Of course, if the compiler doesn't see the lack of alignment and applies the pragma, your code may break.  You ask us to suggest improvements without showing us what you are currently using.

DLake1
New Contributor I
64 Views

I said my code is still pretty much what it was in post #3 but I just tried splitting the array into 2 and even 4 arrays and I got a performance improvement but it doesn't seem to want to go over 20GB/s:

What I was trying to get at with this thread originally was to do with the error checking, the compiler assumes errxor is of no significance so the optimization makes it harder to do the error checking.