I would like to know what could be the reason for the compiler to generate such assembly code.
I am using the Intel(R) Composer XE 2013 Update 3 (package 171) under Windows 7 64 bit.
Flags : /EHa /GR /O3 /Oi /Ot /Oa- /Qip /QxSSE4.2 /Qstd=c++0x /Z7
Why the hell is is reading the _ptData variable so many times? (first in rax then going into r8 ,r9 ,r10 and finally back into rax!!!)
I do have the impression that it is falling back to debug code here.
And just in advance I can't provide some special code to reproduce it (I don't have the time to try to reproduce the issue) . I just want some general advices if possible about things I could try to remediate the issue.
000007FEECB621CE 48 89 85 A8 00 00 00 mov qword ptr [_ptData],rax
_ptData.common_data = &localData->common_data;;
000007FEECB621D5 48 8B 95 A8 01 00 00 mov rdx,qword ptr [localData]
000007FEECB621DC 48 8D 4A 70 lea rcx,[rdx+70h]
000007FEECB621E0 4C 8B 85 A8 00 00 00 mov r8,qword ptr [_ptData]
000007FEECB621E7 49 89 48 10 mov qword ptr [r8+10h],rcx
_ptData.x = x;
000007FEECB621EB 4C 8B 8D A8 00 00 00 mov r9,qword ptr [_ptData]
000007FEECB621F2 66 45 89 61 28 mov word ptr [r9+28h],r12w
_ptData.y = y;
000007FEECB621F7 4C 8B 95 A8 00 00 00 mov r10,qword ptr [_ptData]
000007FEECB621FE 66 41 89 72 2A mov word ptr [r10+2Ah],si
_ptData.frameBuffer = frameBufferOrigin +y *bpl + x * bpp;
000007FEECB62203 44 8B 5D 0C mov r11d,dword ptr [bpl]
000007FEECB62207 41 0F AF F3 imul esi,r11d
000007FEECB6220B 48 63 F6 movsxd rsi,esi
000007FEECB6220E 48 03 B5 90 00 00 00 add rsi,qword ptr [pos_delta]
000007FEECB62215 8B 45 10 mov eax,dword ptr [bpp]
000007FEECB62218 44 0F AF E0 imul r12d,eax
000007FEECB6221C 4D 63 E4 movsxd r12,r12d
000007FEECB6221F 49 03 F4 add rsi,r12
000007FEECB62222 48 8B 85 A8 00 00 00 mov rax,qword ptr [_ptData]
000007FEECB62229 48 89 70 20 mov qword ptr [rax+20h],rsi
The code is in a loop but the loop is not unrolled as far as I can see. Since the number of steps of the loop is computed just before I don't think that it will unroll it anyway.
I have also put that code inside its own block to make clear the variable lifetime.
For loop here //
MyDataStruct *_ptData = cached_points_ptr+cache_index;
_ptData->common_data = &localData->common_data; //pointer
_ptData->x = x; //ushort
_ptData->y = y; //ushort
_ptData->frameBuffer = frameBufferOrigin +y *bpl + x * bpp; //pointer
_ptData->zbuffer = zbuffer; //pointer
_ptData->color = col1; //F32vec4
_ptData->z0 = z0; //float
if ( use_texture )
_ptData->u0 = ((const float *)&uv1); //uv1 is a F32vec4
_ptData->v0 = ((const float *)&uv1);
We are not calling the icl directly but via an ant script so I wonder if the order of my options have something to do with that.
I am going to try to remove the Z7 and see what code it produces without it.
Ok it has nothing to do with Z7 being on the command line. I removed it and the code is still the same.
At least that is a relief because I would have hated having to remove the debug info right now.
Actually I have tried with /O2 or /O3 and without /Ot for example and the generated code didn't change at all.
Even with O1 I would expect the multiple read of the structure pointer to go away, this is definitively not advanced optimizations. (Some code that I have posted have not appeared yet apparently).
I really wonder if there isn't something else there somewhere obvious that I am missing...
Now here is an interesting development.
If I put my simple code into a function then the code writing in my struct gets optimized correctly.
I have extra code because of the parameters being passed and read in the function but the code setting the data is not reading the data pointer for every member write. How strange...
Trying to inline the same function to see if it makes a difference now.
Ok I have found the single line that toggle the optimization off.
No idea why it does it, I am trying to write some code to reproduce it. I had almost the same code except that line and it was compiling fine.
That line is just using a class storing 2 ints (Size2 class) and I can't see anything that could cause the issue.
20% speed improvement just by fixing that. I wonder how often that occurs in our code now.
I will try to prepare something tomorrow.
Just to be clear the issue is happening with O2 too. And the code works in both case, it is just slower :)
The interesting part is that the first two lines of my function have almost no relation to the rest of it and the variable causing the issue is not even used after in the function but affect all the code generated.
I should be able to extract something that compiles and shows the issue but it won't run. The code extract is part of something much bigger that would take too long to separate from the full code.
More tomorrow I hope.
so icl does fine with simple testcase. if you could send the preprocessed .i file, it would be great. please note the line that is causing the problem.
Or report it to Intel Premier Support.
Ok I have been able to create a small project showing the issue.
I will make an archive of it. To whom should I send it? Since it contains code that I don't really want to publish in the forum I would prefer if someone could send me a private message with the email address.
Or should I report it to Intel Premier Support instead?