topic Hi Iiya & Tim, in Software Tuning, Performance Optimization & Platform Monitoring

Slow code execution

mlf_c_ — Sat, 01 Feb 2014 23:18:20 GMT

Hi,

when I try to execute the following code on my intel penryn ULV 1.4 core2duo, which consists of fn1() and fn2():

fn1() is visibly slower than fn2() - upon inspection of .s assembly code resulting from gcc -S I noticed that fn1() basically loops a decl instruction ~64 times and fn2() does seem to consist of ~23 instructions including 2 mul iinstructions which need to be repeated 10 times in this example. Despite this fn1() has ~3 times slower execution. (Compilation without -O otherwise gcc applies optimizations that alter the nature of fn1())

Would someone be so kind and elaborate what the cause is for fn1() slower execution?

thanks,

While looking at source code

Bernard — Sun, 02 Feb 2014 10:43:01 GMT

While looking at source code it seems that fn2() should be slower because of modulo operation and division-assignment operation.Upon closer inspection of fn2() variable i is not used and optimizing compiler can exclude this line of code from the compilation.First function has 64 decrement operations and backward conditional jumps.

I suppose that during the looped execution of both functions inside the main() fn2() could be further optimized by compiler when it realizes that fn2() is performing the same operation every loop cycle.

@mlf.c

Bernard — Sun, 02 Feb 2014 10:44:52 GMT

@mlf.c

Can you post disassembled code?

Are you trying to verify past

TimP — Sun, 02 Feb 2014 13:58:35 GMT

Are you trying to verify past research about Penryn partial flag stalls?

Do you remember how Intel worked to get compilers changed to use addl -1 in place of decl, and the world refused to use special options to handle this?

Are you tied to some specific combination of gcc version and -mtune options?

@Tim

Bernard — Mon, 03 Feb 2014 08:11:06 GMT

@Tim

Do you mean partial flag merge stalls?

Hi Iiya & Tim,

mlf_c_ — Mon, 03 Feb 2014 15:59:59 GMT

Hi Iiya & Tim,

I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2()

http://pastie.org/8694561

i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.

Hello mlf,

Patrick_F_Intel1 — Mon, 03 Feb 2014 17:09:55 GMT

Hello mlf,

Are these 2 sections of code important to a real application or are you just curious?

When I run with optimizing turned on VC12, both routines get optimized away... since they don't return a value and don't change any non-local variable.

Assuming this is not just idle curiosity or a homework assignment: You don't really have any timer info around the routines so it is hard to say how many instructions/clocktick are getting executed by each function.

Pat

Hi Pat,

mlf_c_ — Mon, 03 Feb 2014 19:38:35 GMT

Hi Pat,

I have removed all parts of code that didn't seem to affect the speed of execution in order to pinpoint the problem and ended up with this simple piece of code - using clock() does show fn1() is much slower although its not very precise, but from looking at the assembly code posted above I assume movl, addl, subl, sall, shrl, cmp and jumps are still one clock instructions (haven't been coding for a while :) so there are 22 instructions + 2 mulls repeated 10 times as opposed to slower subl, cmp jns repeated 65 times.

@mlf.c

Bernard — Wed, 05 Feb 2014 05:40:51 GMT

@mlf.c

Maybe presence of shrl instruction causes aferomentioned flags merge stalls?

Can you run VTune analysis on your code?

Actually cmp jmp branch

Bernard — Thu, 06 Feb 2014 04:53:53 GMT

Actually cmp jmp branch instruction can be executed in parallel with variable decrement instruction,although dec instruction uop must wait probably for the result of branch instruction.