I'm profiling my application (linux x86_64, SandyBridge, using perf), and one of its functions takes about 30% of the runtime. In the function, looks like saving registers to the stack consumes the most time - this is annotation of function start:
0.39 push %r15
95.35 push %r14
1.27 push %r13
0.16 push %r12
0.15 mov %rdi,%r12
0.25 push %rbp
0.14 mov %rsi,%rbp
0.12 push %rbx
0.18 sub $0x8,%rsp
I tried using RESOURCE_STALLS counter, and looks like RESOURCE_STALLS.SB is high on this instruction as well, in addition i didn't see high cache misses on this instruction. What can i do to continue investigating this? Are there additional counters which i should examine?
Can you profile with VTune or post the VTune front-end , back-end pipeline stalls analysis?
Can you also post that function disassembly?Regarding the push %r14 instruction it is interesting what is beign loaded(pushed) onto stack.It seems that 95.0% of the function's prologue is spent waiting on resource beign available.
In the absence of further evidence, I'd guess the time may be associated with allocating a fill buffer, due to previous code having left the buffers full of data pending flush to L1, as might happen when writing to multiple cache lines in a loop.
If that is so, sometimes forcing the function to in-line or at least take advantage of some ipo could alleviate it.
Can you post functions call stack?Does your function receives some input?Such a geat number of RESOURCES_STALLS.ANY could indicate that pipeline can be stalled for example by previous dependent store instructions or by branch misprediction.Can you perform front-end pipeline stalls analysis and post the results?
The function has no input, and it is called from main() after other functions have been called.
Frontend analysis shoed no branch mispredictions (BR_MISP_RETIRED.** and BACLEARS.ANY were 0). However, see very large count on LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - about 45% of total program count (see attachment). Does it mean that another instruction is trying to load/store from the same page offset? How can i find/eliminate the conflict?
By looking at description of LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event it seems that it measures false dependency in Memory Order Buffer.I suppose that it could be related to reordering load and stores.I think that address aliasing has been detected in store buffer which prevents further in-order memory operation.It could be also what you are suggesting.It could be interesting if you could post full disassembly.I would like to look at previous operations which involved r14.
Thanks for the suggestions, unfortunately I can't post full disassembly due to legal issues..
So, I've tried to use gdb to try tracking down address which may alias my stack, but no luck. Then, took the advice of TimP and ilyapolak above, and tried to use "mfence" to narrow down the problem. Finally, i've found a piece of code which performs small PCI write, and then sfence, and runs before the function above is called:
asm volatile ("sfence":::"memory");
Putting the mfence *before* this sfence had almost no effect. However, putting my mfence right *after* this sfence, makes the "mfence" consume many cycles, instead of the "push r14" which comes sometime after it. Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application.
So my conclusion was - the PCI write is slow, and sfence made subsequent writes wait for its completion. So the unlucky instruction which filled the store buffer got the hit and actually waited for this PCI write to complete. Does that sound reasonable?
Also - looks like all performance counters related to memory stores as asynchronous - in the sense they indicate a problem sometime after the root cause of the problem, so need to use mfence to narrow it down. Also, looks like mfence is "synchronous" while sfence/lfence are not, is this true?
If you cannot post the code disassembly, maybe you can track by yourself memory operations which used r14 register prior to the function prologue and post your findings.I suspect that there could be some kind of WAR issue.
>>>Also - looks like all performance counters related to memory stores as asynchronous >>>
I think that internally performance counters related to specific event could be incremented when micro-code and internal logic detects occurence of those events.
I guess the major difference is that I have slow PCI write, while your algorithms probably have only RAM accesses.. So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?