Is stall cycles all for memory access?

Shouvik_B_ · ‎10-01-2012

Apologies if this is obvious and/or answered before. I have used perf on Linux (Ubuntu) and got data which I paste below. Based on my calculation, 48% of the cycles are stalled. Does it mean that almost half the time, program spends fetching data from the Memory (No I/O in my code)? BTW, it is a simple java program doing some CPU work. Thanks for any help.

5659.824477 task-clock                #    0.998 CPUs utilized
             2,633 context-switches          #    0.000 M/sec
                 4 CPU-migrations            #    0.000 M/sec
            83,941 page-faults               #    0.015 M/sec
    17,757,980,175 cycles                    #    3.138 GHz                     [83.24%]
     8,646,314,253 stalled-cycles-frontend   #   48.69% frontend cycles idle    [83.44%]
     3,526,296,727 stalled-cycles-backend    #   19.86% backend cycles idle    [66.95%]
    24,624,807,823 instructions              #    1.39 insns per cycle
                                             #    0.35 stalled cycles per insn [83.39%]
     3,117,168,198 branches                  # 550.754 M/sec                   [83.28%]
        90,353,461 branch-misses             #    2.90% of all branches         [83.13%]

       5.673470329 seconds time elapsed

Shouvik_B_ · ‎10-03-2012

I have found some good info here, albeit for Itanium. Now I need to find similar info for Sandy Bridge which is my processor. http://software.intel.com/en-us/articles/characterize-application-performance-with-stall-events-on-64-bit-architecture

McCalpinJohn · ‎10-04-2012

The big problem in this sort of analysis is attributing stalls to specific causes when there are multiple underlying stall conditions. There are also multiple functional units in any modern processor, so you have to decide whether a stall on one or more units is really a stall if other units are able to get work done (or initiate new work) in the same cycle. A good performance analysis overview for processors using the Nehalem/Westmere cores is at http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf Most of the analysis is applicable to Sandy Bridge cores, though many of the specific performance counter events have changed. The Intel Arch SW Developer's Guide, Volume 3, lists all of the performance monitor events for the various processor families. For Sandy Bridge, performance monitor event 0Eh, Mask 01h, counts the number of uops issued each cycle. The notes for that entry in Table 19-3 tell which bits to set to get the event to count cycles for which zero uops are *issued*. Similarly, Event C2h, mask 01h can be used to count the number of cycles in which no uops are *retired*. Either of these cases can be considered "stall" conditions, but there will be a lot of overlap between the two events, so you want to count one or the other, but not the sum. (Perhaps the larger of the two?) Also on Sandy Bridge, Events 59h, 5Bh, 87h, A2h count some specific stall conditions. The most useful things to look at are probably Event A2 with Mask 02h to count stalls due to lack of free load buffers and Event A2 with Mask 08h to count stalls due to lack of free store buffers. It can require a lot of knowledge of the microarchitecture to understand what these events mean in detail. Intel provides some of this information, but it is spread across a lot of documents. The best source of general information on how to optimize to eliminate stalls (including info for Sandy Bridge) is probably the Intel SW Optimization guide (document 248966, I use revision 026 from April 2012).

Shouvik_B_ · ‎10-04-2012

Thanks a lot for the detailed info. I will try out your suggestions. My current goal is to mainly determine the time for stalls due to memory contention (when multiple cores are working). Thanks again.