Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Slow code execution

mlf_c_
Beginner
558 Views

Hi,

when I try to execute the following code on my intel penryn ULV 1.4 core2duo, which consists of fn1() and fn2():

http://paste.org/70232

fn1() is visibly slower than fn2() - upon inspection of .s assembly code resulting from gcc -S I noticed that fn1() basically loops a decl instruction ~64 times and fn2() does seem to consist of ~23 instructions including 2 mul iinstructions which need to be repeated 10 times in this example. Despite this fn1() has ~3 times slower execution. (Compilation without -O otherwise gcc applies optimizations that alter the nature of fn1())

Would someone be so kind and elaborate what the cause is for fn1() slower execution?

 

thanks,

 

M

0 Kudos
9 Replies
Bernard
Valued Contributor I
558 Views

While looking at source code it seems that fn2() should be slower because of modulo operation and division-assignment operation.Upon closer inspection of fn2() variable i  is not used and optimizing compiler can exclude this line of code from the compilation.First function has 64 decrement operations and backward conditional jumps.

I suppose that during the looped execution of both functions inside the main()  fn2() could be further optimized by compiler when it realizes that fn2() is performing the same operation every loop cycle.

0 Kudos
Bernard
Valued Contributor I
558 Views

@mlf.c

Can you post disassembled code?

0 Kudos
TimP
Honored Contributor III
558 Views

Are you trying to verify past research about Penryn partial flag stalls?

Do you remember how Intel worked to get compilers changed to use addl -1 in place of decl, and the world refused to use special options to handle this?

Are you tied to some specific combination of gcc version and -mtune options?

0 Kudos
Bernard
Valued Contributor I
558 Views

@Tim

Do you mean partial flag merge stalls?

0 Kudos
mlf_c_
Beginner
558 Views

Hi Iiya & Tim,

I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2() 

http://pastie.org/8694561

i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.

0 Kudos
Patrick_F_Intel1
Employee
558 Views

Hello mlf,

Are these 2 sections of code important to a real application or are you just curious?

When I run with optimizing turned on VC12, both routines get optimized away... since they don't return a value and don't change any non-local variable.

Assuming this is not just idle curiosity or a homework assignment: You don't really have any timer info around the routines so it is hard to say how many instructions/clocktick are getting executed by each function.

Pat

0 Kudos
mlf_c_
Beginner
558 Views

Hi Pat,

 
I have removed all parts of code that didn't seem to affect the speed of execution in order to pinpoint the problem and ended up with this simple piece of code - using clock() does show fn1() is much slower although its not very precise, but from looking at the assembly code posted above I assume movl, addl, subl, sall, shrl, cmp and jumps are still one clock instructions (haven't been coding for a while :) so there are 22 instructions + 2 mulls repeated 10 times as opposed to slower subl, cmp jns repeated 65 times. 
0 Kudos
Bernard
Valued Contributor I
558 Views

@mlf.c

Maybe presence of shrl instruction causes aferomentioned flags merge stalls?

Can you run VTune analysis on your code?

0 Kudos
Bernard
Valued Contributor I
558 Views

Actually cmp jmp branch instruction can be executed in parallel with variable decrement instruction,although dec instruction uop must wait probably for the result of branch instruction.

 

0 Kudos
Reply