topic @developer in Software Tuning, Performance Optimization & Platform Monitoring

function prolugue consumes many cycles

developer1 — Mon, 18 Nov 2013 16:18:33 GMT

Hi,

I'm profiling my application (linux x86_64, SandyBridge, using perf), and one of its functions takes about 30% of the runtime. In the function, looks like saving registers to the stack consumes the most time - this is annotation of function start:

0.39 push %r15
95.35 push %r14
1.27 push %r13
0.16 push %r12
0.15 mov %rdi,%r12
0.25 push %rbp
0.14 mov %rsi,%rbp
0.12 push %rbx
0.18 sub $0x8,%rsp

I tried using RESOURCE_STALLS counter, and looks like RESOURCE_STALLS.SB is high on this instruction as well, in addition i didn't see high cache misses on this instruction. What can i do to continue investigating this? Are there additional counters which i should examine?

Thanks

Can you profile with VTune or

Bernard — Mon, 18 Nov 2013 17:28:18 GMT

Can you profile with VTune or post the VTune front-end , back-end pipeline stalls analysis?

Can you also post that function disassembly?Regarding the push %r14 instruction it is interesting what is beign loaded(pushed) onto stack.It seems that 95.0% of the function's prologue is spent waiting on resource beign available.

Hi,

developer1 — Mon, 18 Nov 2013 18:20:20 GMT

Hi,

I've uploaded the vtune sampling. I hope these are the correct counters... I've selected General Exploration. I see there are first level TLB misses, but there are other places in the code which have even more dTLB misses and much less cycles..

--Yossi

In the absence of further

TimP — Mon, 18 Nov 2013 19:41:30 GMT

In the absence of further evidence, I'd guess the time may be associated with allocating a fill buffer, due to previous code having left the buffers full of data pending flush to L1, as might happen when writing to multiple cache lines in a loop.

If that is so, sometimes forcing the function to in-line or at least take advantage of some ipo could alleviate it.

thanks for the reply.

developer1 — Mon, 18 Nov 2013 20:18:39 GMT

thanks for the reply. is there a performance counter which could indicate this situation or show which code stores too many cache lines to L1?

>>...In the function, looks

SergeyKostrov — Tue, 19 Nov 2013 02:42:24 GMT

>>...In the function, looks like saving registers to the stack consumes the most time... Try to use a structure of parameters which could be passed to the function and in that case only one parameter will be saved on the stack. I have no idea what is wrong with your codes but it is possible that there is a problem with the stack alignment.

Can you post functions call

Bernard — Tue, 19 Nov 2013 09:46:59 GMT

Can you post functions call stack?Does your function receives some input?Such a geat number of RESOURCES_STALLS.ANY could indicate that pipeline can be stalled for example by previous dependent store instructions or by branch misprediction.Can you perform front-end pipeline stalls analysis and post the results?

Hi,

developer1 — Tue, 19 Nov 2013 13:36:39 GMT

Hi,

The function has no input, and it is called from main() after other functions have been called.

Frontend analysis shoed no branch mispredictions (BR_MISP_RETIRED.** and BACLEARS.ANY were 0). However, see very large count on LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - about 45% of total program count (see attachment). Does it mean that another instruction is trying to load/store from the same page offset? How can i find/eliminate the conflict?

Thanks

And this is the file..

developer1 — Tue, 19 Nov 2013 13:38:47 GMT

And this is the file..

>>...Does it mean that

SergeyKostrov — Tue, 19 Nov 2013 14:44:12 GMT

>>...Does it mean that another instruction is trying to load/store from the same page offset?.. It is Not clear and so far I could only say that it is a really strange problem. Could you create a simple reproducer?

>>...I'm profiling my

SergeyKostrov — Tue, 19 Nov 2013 14:58:03 GMT

>>...I'm profiling my application ( linux x86_64... I just realized that you have an interesting system. Is that a 32-bit operating system with 64-bit memory address space extensions?

By looking at description of

Bernard — Tue, 19 Nov 2013 15:18:08 GMT

By looking at description of LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event it seems that it measures false dependency in Memory Order Buffer.I suppose that it could be related to reordering load and stores.I think that address aliasing has been detected in store buffer which prevents further in-order memory operation.It could be also what you are suggesting.It could be interesting if you could post full disassembly.I would like to look at previous operations which involved r14.

Hi,

developer1 — Tue, 19 Nov 2013 17:28:00 GMT

Hi,

Thanks for the suggestions, unfortunately I can't post full disassembly due to legal issues..

So, I've tried to use gdb to try tracking down address which may alias my stack, but no luck. Then, took the advice of TimP and ilyapolak above, and tried to use "mfence" to narrow down the problem. Finally, i've found a piece of code which performs small PCI write, and then sfence, and runs before the function above is called:

[cpp]

write_pci(&pci_address, data);

asm volatile ("sfence":::"memory");

[/cpp]

Putting the mfence *before* this sfence had almost no effect. However, putting my mfence right *after* this sfence, makes the "mfence" consume many cycles, instead of the "push r14" which comes sometime after it. Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application.

So my conclusion was - the PCI write is slow, and sfence made subsequent writes wait for its completion. So the unlucky instruction which filled the store buffer got the hit and actually waited for this PCI write to complete. Does that sound reasonable?

Also - looks like all performance counters related to memory stores as asynchronous - in the sense they indicate a problem sometime after the root cause of the problem, so need to use mfence to narrow it down. Also, looks like mfence is "synchronous" while sfence/lfence are not, is this true?

@developer

Bernard — Tue, 19 Nov 2013 17:37:38 GMT

@developer

If you cannot post the code disassembly, maybe you can track by yourself memory operations which used r14 register prior to the function prologue and post your findings.I suspect that there could be some kind of WAR issue.

>>>Also - looks like all

Bernard — Tue, 19 Nov 2013 17:42:05 GMT

>>>Also - looks like all performance counters related to memory stores as asynchronous >>>

I think that internally performance counters related to specific event could be incremented when micro-code and internal logic detects occurence of those events.

>>...Also, removing all

SergeyKostrov — Wed, 20 Nov 2013 05:22:00 GMT

>>...Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application. It is a very interesting result because in several linear algebra algorithms I've implemented usage SFENCE improves performance of the processing by ~5% on 32-bit and 64-bit WIndows systems with Pentium 4, Atom and Ivy Bridge processors. That is, we have completely different results. Also, I've experimented with SFENCE and I figured out that it needs to be placed in a proper place during processing and it depends on an algorithm.

@Sergey

developer1 — Wed, 20 Nov 2013 16:59:58 GMT

@Sergey

I guess the major difference is that I have slow PCI write, while your algorithms probably have only RAM accesses.. So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?

>>...So you are saying SFENCE

SergeyKostrov — Thu, 21 Nov 2013 05:45:37 GMT

>>...So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?.. Yes, exactly! If you're interested to complete an experiment try to use a 3-loop matrix multiplication algorithm ( a classic-form or a transposed-form ) for verifications if it works on your computer.