Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Inefficient assembler code generated for a simple while-loop?

tj1
Beginner
342 Views

Hi,
I have the following piece of code to execute a simple loop.

void greyscale_iterator2(Image32& image32, Image8& image8)
{
Image32::iterator begin = image32.row_major_begin();
Image32::iterator end = image32.row_major_end();
Image8::iterator it8 = image8.row_major_begin();
GreyScaleFunctor gsf;
while(begin != end)
{
*it8 = gsf(*begin);
++begin;
++it8;
}
}

I noticed when compiling and running this code under Intel C++ compiler, it didn't perform as well as I expected so I took a look at the assembler code in the debugger:

while(begin != end)

01392D31 push eax

01392D32 lea eax,[begin]

01392D35 lea edx,[end]

01392D38 mov dword ptr [esp],edx

01392D3B mov ecx,eax

01392D3D call ImageExperiments::Image32Iterator::operator!= (139103Ch)

01392D42 mov byte ptr [ebp-74h],al

01392D45 movzx eax,byte ptr [ebp-74h]

01392D49 movzx eax,al

01392D4C test eax,eax

01392D4E je ImageExperiments::greyscale_iterator2+0BCh (1392DACh)

{

*it8 = gsf(*begin);

01392D50 lea eax,[begin]

01392D53 mov ecx,eax

01392D55 call ImageExperiments::Image32Iterator::operator* (13910A5h)

01392D5A mov dword ptr [ebp-10h],eax

01392D5D push eax

01392D5E lea eax,[gsf]

01392D61 mov edx,dword ptr [ebp-10h]

01392D64 mov edx,dword ptr [edx]

01392D66 mov dword ptr [esp],edx

01392D69 mov ecx,eax

01392D6B call ImageExperiments::GreyScaleFunctor::operator() (139101Eh)

01392D70 mov byte ptr [ebp-72h],al

01392D73 movzx eax,byte ptr [ebp-72h]

01392D77 mov byte ptr [ebp-71h],al

01392D7A lea eax,[it8]

01392D7D mov ecx,eax

01392D7F call ImageExperiments::Image8Iterator::operator* (1391050h)

01392D84 mov dword ptr [ebp-0Ch],eax

01392D87 mov eax,dword ptr [ebp-0Ch]

01392D8A movzx edx,byte ptr [ebp-71h]

01392D8E mov byte ptr [eax],dl

++begin;

01392D90 lea eax,[begin]

01392D93 mov ecx,eax

01392D95 call ImageExperiments::Image32Iterator::operator++ (1391028h)

01392D9A mov dword ptr [ebp-8],eax

++it8;

01392D9D lea eax,[it8]

01392DA0 mov ecx,eax

01392DA2 call ImageExperiments::Image8Iterator::operator++ (1391014h)

01392DA7 mov dword ptr [ebp-4],eax

01392DAA jmp ImageExperiments::greyscale_iterator2+41h (1392D31h)

}

}

00CA2DAC leave

00CA2DAD ret


and then compared this with the output of the MSVC10 compiler:

while(begin != end)

010316E0 lea eax,[end]

010316E3 push eax

010316E4 lea ecx,[begin]

010316E7 call ImageExperiments::Image32Iterator::operator!= (1031096h)

010316EC movzx ecx,al

010316EF test ecx,ecx

010316F1 je ImageExperiments::greyscale_iterator2+74h (1031724h)

{

*it8 = gsf(*begin);

010316F3 lea ecx,[begin]

010316F6 call ImageExperiments::Image32Iterator::operator* (10311EAh)

010316FB mov eax,dword ptr [eax]

010316FD push eax

010316FE lea ecx,[gsf]

01031701 call ImageExperiments::GreyScaleFunctor::operator() (1031032h)

01031706 mov bl,al

01031708 lea ecx,[it8]

0103170B call ImageExperiments::Image8Iterator::operator* (1031118h)

01031710 mov byte ptr [eax],bl

++begin;

01031712 lea ecx,[begin]

01031715 call ImageExperiments::Image32Iterator::operator++ (1031041h)

++it8;

0103171A lea ecx,[it8]

0103171D call ImageExperiments::Image8Iterator::operator++ (103101Eh)

}

01031722 jmp ImageExperiments::greyscale_iterator2+30h (10316E0h)

}

01031724 pop edi

01031725 pop esi

01031726 pop ebx

01031727 mov esp,ebp

01031729 pop ebp

0103172A ret


It appears that there much more code (~50%) being generated for this simple loop by theIntel C++compiler, and I believe this is what is causing the poor relative performance (I examined and compared the code generated for the various functions used in this snippet, i.e., the iterator dereference, increment, not equals and constructor, and the Intelgenerated code looked okay, if anything it is slightly more concise than the MSVC++ code).

I would appreciate it very much it could be explained to me why the Intel generated code is so much bulkier, and what I can do about this. Note that this is a follow-on from a previous thread http://software.intel.com/en-us/forums/showthread.php?t=106290, and as stated there, I want to retain the iterator interface.
0 Kudos
2 Replies
TimP
Honored Contributor III
342 Views
Without a compilable sample and a specification of which compilers you consider to be "the" compilers (e.g. 32- vs. 64- bit mode, /arch specification, ...) it's impossible to give anything like a complete answer.
Intel compilers tend to be more strongly oriented toward countable loops, yet without the machinery to convert a trivial example such as this to a branch to a countable case (presumably the normal one) and the ill-formed case (e.g. case (end - begin < 0)).
Trivial translation, discarding ugly case:
for(ptrdiff_t count = end - begin; count > 0; --count){
.....
}
For some, the tradition of handling the ugly case is stronger in 32-bit mode (the ugly 64-bit case will hang whether it is done "correctly" or not).
For reasons of practicality, it was necessary to take a sane interpretation of some of the STL linear iterators which are vectorizable aside from the ugly case, and each compiler takes its own path there. As the VS2012 compiler has been advertised as supporting auto-vectorization but has not been unveiled to many of us, it may turn out quite different from VS2010.
0 Kudos
tj1
Beginner
342 Views

I have uploaded the compilable source in a zip file, along with the Visual Studio 10 project files.
The Intel compiler is Intel C++ Compiler XE on IA-32, version 12.1.5 Package ID: w_ccompxe_2011.11.344. The processor is Intel Core i5-2520M CPU @ 2.50 GHz, with 4 Gb RAM and 64 bit OS. The MS compiler/platform is Microsoft Visual C++ 2010 v100

0 Kudos
Reply