function prolugue consumes many cycles

developer1 · ‎11-18-2013

Hi,

I'm profiling my application (linux x86_64, SandyBridge, using perf), and one of its functions takes about 30% of the runtime. In the function, looks like saving registers to the stack consumes the most time - this is annotation of function start:

0.39 push %r15
95.35 push %r14
1.27 push %r13
0.16 push %r12
0.15 mov %rdi,%r12
0.25 push %rbp
0.14 mov %rsi,%rbp
0.12 push %rbx
0.18 sub $0x8,%rsp

I tried using RESOURCE_STALLS counter, and looks like RESOURCE_STALLS.SB is high on this instruction as well, in addition i didn't see high cache misses on this instruction. What can i do to continue investigating this? Are there additional counters which i should examine?

Thanks

Bernard · ‎11-18-2013

Can you profile with VTune or post the VTune front-end , back-end pipeline stalls analysis?

Can you also post that function disassembly?Regarding the push %r14 instruction it is interesting what is beign loaded(pushed) onto stack.It seems that 95.0% of the function's prologue is spent waiting on resource beign available.

developer1 · ‎11-18-2013

Hi,

I've uploaded the vtune sampling. I hope these are the correct counters... I've selected General Exploration. I see there are first level TLB misses, but there are other places in the code which have even more dTLB misses and much less cycles..

--Yossi

TimP · ‎11-18-2013

In the absence of further evidence, I'd guess the time may be associated with allocating a fill buffer, due to previous code having left the buffers full of data pending flush to L1, as might happen when writing to multiple cache lines in a loop.

If that is so, sometimes forcing the function to in-line or at least take advantage of some ipo could alleviate it.

developer1 · ‎11-18-2013

thanks for the reply. is there a performance counter which could indicate this situation or show which code stores too many cache lines to L1?

SergeyKostrov · ‎11-18-2013

>>...In the function, looks like saving registers to the stack consumes the most time... Try to use a structure of parameters which could be passed to the function and in that case only one parameter will be saved on the stack. I have no idea what is wrong with your codes but it is possible that there is a problem with the stack alignment.

Bernard · ‎11-19-2013

Can you post functions call stack?Does your function receives some input?Such a geat number of RESOURCES_STALLS.ANY could indicate that pipeline can be stalled for example by previous dependent store instructions or by branch misprediction.Can you perform front-end pipeline stalls analysis and post the results?

developer1 · ‎11-19-2013

Hi,

The function has no input, and it is called from main() after other functions have been called.

Frontend analysis shoed no branch mispredictions (BR_MISP_RETIRED.** and BACLEARS.ANY were 0). However, see very large count on LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - about 45% of total program count (see attachment). Does it mean that another instruction is trying to load/store from the same page offset? How can i find/eliminate the conflict?

Thanks

developer1 · ‎11-19-2013

And this is the file..

SergeyKostrov · ‎11-19-2013

>>...Does it mean that another instruction is trying to load/store from the same page offset?.. It is Not clear and so far I could only say that it is a really strange problem. Could you create a simple reproducer?

SergeyKostrov · ‎11-19-2013

>>...I'm profiling my application ( linux x86_64... I just realized that you have an interesting system. Is that a 32-bit operating system with 64-bit memory address space extensions?

Bernard · ‎11-19-2013

By looking at description of LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event it seems that it measures false dependency in Memory Order Buffer.I suppose that it could be related to reordering load and stores.I think that address aliasing has been detected in store buffer which prevents further in-order memory operation.It could be also what you are suggesting.It could be interesting if you could post full disassembly.I would like to look at previous operations which involved r14.

developer1 · ‎11-19-2013

Hi,

Thanks for the suggestions, unfortunately I can't post full disassembly due to legal issues..

So, I've tried to use gdb to try tracking down address which may alias my stack, but no luck. Then, took the advice of TimP and ilyapolak above, and tried to use "mfence" to narrow down the problem. Finally, i've found a piece of code which performs small PCI write, and then sfence, and runs before the function above is called:

[cpp]

write_pci(&pci_address, data);

asm volatile ("sfence":::"memory");

[/cpp]

Putting the mfence *before* this sfence had almost no effect. However, putting my mfence right *after* this sfence, makes the "mfence" consume many cycles, instead of the "push r14" which comes sometime after it. Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application.

So my conclusion was - the PCI write is slow, and sfence made subsequent writes wait for its completion. So the unlucky instruction which filled the store buffer got the hit and actually waited for this PCI write to complete. Does that sound reasonable?

Also - looks like all performance counters related to memory stores as asynchronous - in the sense they indicate a problem sometime after the root cause of the problem, so need to use mfence to narrow it down. Also, looks like mfence is "synchronous" while sfence/lfence are not, is this true?

Bernard · ‎11-19-2013

@developer

If you cannot post the code disassembly, maybe you can track by yourself memory operations which used r14 register prior to the function prologue and post your findings.I suspect that there could be some kind of WAR issue.

Bernard · ‎11-19-2013

>>>Also - looks like all performance counters related to memory stores as asynchronous >>>

I think that internally performance counters related to specific event could be incremented when micro-code and internal logic detects occurence of those events.

SergeyKostrov · ‎11-19-2013

>>...Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application. It is a very interesting result because in several linear algebra algorithms I've implemented usage SFENCE improves performance of the processing by ~5% on 32-bit and 64-bit WIndows systems with Pentium 4, Atom and Ivy Bridge processors. That is, we have completely different results. Also, I've experimented with SFENCE and I figured out that it needs to be placed in a proper place during processing and it depends on an algorithm.

developer1 · ‎11-20-2013

@Sergey

I guess the major difference is that I have slow PCI write, while your algorithms probably have only RAM accesses.. So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?

SergeyKostrov · ‎11-20-2013

>>...So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?.. Yes, exactly! If you're interested to complete an experiment try to use a 3-loop matrix multiplication algorithm ( a classic-form or a transposed-form ) for verifications if it works on your computer.