Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Bad code generation on inderect loops

Bert_Jonson
Beginner
423 Views

Simple test-case:

[cpp]#include <stdio.h>

int main() {

for(int i = 100; i--; )
puts("hello world\n");
}[/cpp]

ICC 13 update 1 with /O3 generates this:

loc_401037:
push offset aHelloWorld ; "hello world\n"
call sub_401060
add esp, 4
dec esi
cmp esi, 0FFFFFFFFh
 jnz short loc_401037

cmp does totally noghing. GCC and MSVC generates better code without cmp.

BTW: when will be new maybe beta release of ICC with delegating constructors and other new C++ features?

0 Kudos
9 Replies
bernaske
New Contributor I
423 Views
#include 2 int main() { 3 for(int i = 100; i--; ) 4 puts("hello worldn"); 5 } 1.) semicolon after i-- is wrong 2..) puts(" Hello world \n"); is the correct 3.) variable i is not defined for( int i .... ) is allow in the next c / c++ standard (c99 and so far ) include int main() { int i; for( i = 0; i < 101; i-- ) puts("hello world \n"); } this testcase works with option -O3 and Parallel Studio XE 2013 under linux without problems
0 Kudos
SergeyKostrov
Valued Contributor II
423 Views
I'd like to ask you a couple of questions... How many times do you want do display the 'hello world' phrase when for implemented as follows? ... int i; for( i = 0; i < 101; i-- ) puts( "hello world" ); ... [ Note ] In the above case the phrase will displayed ( ( 2^32 ) ) times. Is that what you wanted? I understood that you simply wanted to verify a quality of code generation of Intel C++ compiler. Is that correct? Best regards, Sergey PS: The following code will display the phrase 100 times: ... int i; for( i = 100; i > 0; i-- ) puts( "hello world" ); ...
0 Kudos
TimP
Honored Contributor III
423 Views
The originally quoted syntax of the for should be OK under C++ (as originally implied) or C99. Without fixing the worldn typo, you might overflow a buffer. As you imply Windows, the name of the compiler would be ICL, and its decision to use C++ normally would be based on the file name. Like the others, I don't see how you can prove whether one instruction sequence or another is faster for controlling a loop which involves a function call and i/o. I assume that /O3 has little effect when the loop is executing a non-inline function call. ICL does have a bias against downward counting for loops, although I can't see it making a difference in this case. gcc has an automatic transformation to implement upward counting loops with downward count in certain situations (not where vectorization is a possibility). So it's hard to make a case that writing your loop with a downward count will optimize it.
0 Kudos
SergeyKostrov
Valued Contributor II
423 Views
>>...gcc has an automatic transformation to implement upward counting loops with downward count in certain situations... Tim, where did you read about it? That sounds very interesting and I think it could change a logic of some algorithms when there is a break statement inside of a for loop. It means, that a different number of iterations will be needed to hit a break confition. Is that important? Yes, because if a developer implemented a for with downward count something forced the developer to do it. Best regards, Sergey
0 Kudos
Bert_Jonson
Beginner
423 Views
No, my code is correct and puts will be executed 100 times. We can trace it: for(int i = 2; i--; ) {} 1: i == 2 and loop condition is ok(loop body has executed), "i" have to decremented and will be 1 2: i == 1, condition is ok(loop body executed), now i == 0 3: i == 0, so loop will be exited, but it also decrements "i" after exit, so after the loop i == -1, but there is no problem with it, we don't use i after loop. So loop body will be executed 2 times, that we expected. Next code generates MSVC on this loop: loc_401006: push offset aHelloWorld ; "hello world\n" call sub_40101A add esp, 4 dec esi jnz short loc_401006 And GCC: loc_4077B4: mov [esp+14h+var_14], offset aHelloWorld ; "hello world\n" call puts sub ebx, 1 jnz short loc_4077B4 I don't say about only this code. It seems that ICC doesn't know that any cmp with 0 after inc/dec/sub/add is useless because dec/inc/sub/add already have to sets Z flag. It simple wastes cpu ticks.
0 Kudos
Georg_Z_Intel
Employee
423 Views
Hello, that's a good finding! I forwarded it to compiler engineering (DPD200239520) and let you know about the progress. Using a pre-decrement could be used as a workaround as it does not show the superfluous "cmp": [cpp] #include int main() { for(int i = 101; --i; ) puts("hello worldn"); } [/cpp] [plain] ..B1.2: movl $.L_2__STRING.0, %edi call puts ..B1.3: decl %r12d jne ..B1.2 [/plain] %r12d starts with 100 here and does not need the "cmp" for underflow checking. In your original example the initial value was 99, thus requiring the inefficient handling of underflow. I'd recommend pre-decrement/increment operators in general as they also have other advantages when it comes to OOP. Best regards, Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
423 Views
The code generated is correct, and not superfulous, though it could be coded differently for(int i = 100; i--; ) is postfix, meaning the for loop body executes when i prior to -- is non-zero, and thus will execute body with i==0. The test for -1 is correct, however you could potentially use jge immediately following the dec (without the cmp). *** note though, the code change (removal of cmp and use of jge) would not necessarily be faster. You can test for this with use of in-line assembly. If you do, be sure you align the code such that cache line issues do not skew the results to favor one technique over the other. Jim Dempsey
0 Kudos
Georg_Z_Intel
Employee
423 Views
Hello Jim, "superfluous" in terms of performance, not in terms of semantic. Yes, the generated code is correct and makes sense once the induction variable "i" is used. But here it is not and shifting the bounds can get us rid of the "cmp". The initial request is still valid -- admittedly for such rare cases only. Best regards, Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
423 Views
George, >>"superfluous" in terms of performance, not in terms of semantic. I had not run a proper test to verify if "dec; jge" is faster/slower/same as "dec; cmp; jg". Compiler optimizations are not about "elegant semantics", rather "ultimate performance". Often the fewer instructions is faster but not always, I am pointing this out to the readers. Besides, in Bert's second post: >>I don't say about only this code. It seems that ICC doesn't know that any cmp with 0 after inc/dec/sub/add is useless because dec/inc/sub/add already have to sets Z flag...It simple wastes cpu ticks.<< Testing for Z would have been in error (jnz). The code use of esi (to represent the value of i) would require testing the sign of the result (jge). GCC correctly uses jnz because the loop control variable was recognized as not being used and thus the compiler substituted an interation count as opposed to using a loop control variable. As to which is better (faster)... run a proper test and measure the results (CPU architecture will cause variance). Jim Dempsey
0 Kudos
Reply