Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

seemingly bogus mem-store hardware performance counter data

gostanian__richard
New Contributor I
574 Views
I'm experimenting with perf on a 4 socket Ivy Bridge system running Centos7 and having a hard time understanding some of the output. A typical case is the following really simple C program main() { while (1) { } } The following is the assembly code main(): push %rbp mov %rsp,%rbp jmp 4 I name the executable sf and run perf for 10 seconds with the following performance counters perf stat -e cpu-clock,major-faults,minor-faults,context-switches,cpu-migrations,instructions,cycles,stalled-cycles-frontend,mem-stores sf and get the following results 9064.316086 cpu-clock (msec) # 1.000 CPUs utilized 0 major-faults # 0.000 K/sec 109 minor-faults # 0.012 K/sec 5 context-switches # 0.001 K/sec 1 cpu-migrations # 0.000 K/sec 25,983,671,864 instructions # 1.00 insn per cycle # 0.50 stalled cycles per insn 26,018,167,454 cycles # 2.870 GHz 13,024,243,875 stalled-cycles-frontend # 50.06% frontend cycles idle 6,171,784 mem-stores # 0.681 M/sec 9.065339662 seconds time elapsed How can there be so many mem-stores, which increase the longer I run? How can the IPC only be 1.00 and why are we stalling in the frontend when all its doing is some register activity and maybe in the worst case going to the L1? I would expect the IPC to be at least 2 and closer to 3 and I wouldn't expect any memory stores. By the way, I had the whole machine to myself when I ran this. My conclusion is that perf can't be trusted. Is this correct and if so are there alternatives to perf for getting correct hardware performance data?
0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
574 Views

According to https://www.agner.org/optimize/instruction_tables.pdf (page 193), the "push" instruction on an Ivy Bridge processor has a latency of three cycles.  Page 195 reports that the "jmp" instruction has a maximum throughput of one instruction every 2 cycles on Ivy Bridge.  From those values an IPC of 1.00 seems reasonable enough.

For test cases like this -- single thread, 10-second runtime -- "perf stat" usually manages to report reasonable counter values. 

Estimating the "expected" IPC can be tricky, especially when there are multiple uops in an instruction or multiple instructions that fuse into a single uop.   Agner's Instruction Tables provide good information about these cases.  Agner's Microarchitecture document is very helpful for understanding the historical progression of the implementations https://www.agner.org/optimize/microarchitecture.pdf

0 Kudos
Reply