when I try to execute the following code on my intel penryn ULV 1.4 core2duo, which consists of fn1() and fn2():
fn1() is visibly slower than fn2() - upon inspection of .s assembly code resulting from gcc -S I noticed that fn1() basically loops a decl instruction ~64 times and fn2() does seem to consist of ~23 instructions including 2 mul iinstructions which need to be repeated 10 times in this example. Despite this fn1() has ~3 times slower execution. (Compilation without -O otherwise gcc applies optimizations that alter the nature of fn1())
Would someone be so kind and elaborate what the cause is for fn1() slower execution?
While looking at source code it seems that fn2() should be slower because of modulo operation and division-assignment operation.Upon closer inspection of fn2() variable i is not used and optimizing compiler can exclude this line of code from the compilation.First function has 64 decrement operations and backward conditional jumps.
I suppose that during the looped execution of both functions inside the main() fn2() could be further optimized by compiler when it realizes that fn2() is performing the same operation every loop cycle.
Are you trying to verify past research about Penryn partial flag stalls?
Do you remember how Intel worked to get compilers changed to use addl -1 in place of decl, and the world refused to use special options to handle this?
Are you tied to some specific combination of gcc version and -mtune options?
Hi Iiya & Tim,
I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2()
i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.
Are these 2 sections of code important to a real application or are you just curious?
When I run with optimizing turned on VC12, both routines get optimized away... since they don't return a value and don't change any non-local variable.
Assuming this is not just idle curiosity or a homework assignment: You don't really have any timer info around the routines so it is hard to say how many instructions/clocktick are getting executed by each function.
Actually cmp jmp branch instruction can be executed in parallel with variable decrement instruction,although dec instruction uop must wait probably for the result of branch instruction.