I am trying to analyze some benchmarks and see how much of their stall cycles are related to memory access. I looked at the documents: "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors".
Could anyone suggest how to identify memory related stalls?
Please refer to this article.
Measure memory stalls impacts - (This is average value (cycles) for each stall)
UOPS_EXECUTED.CORE_STALLS_CYCLES / UOPS_EXECUTED.CORE_STALLS_COUNT
Cycle count when no Uops were executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.
Counts when there is 1 or more Uops executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.
Hope it helps.
I have looked at the document byDr.Levinthal, but still have some confusion regarding my initial question. Let me ask a few more questions. Does the UOPS_EXECUTED.CORE_STALL_CYCLES/COUNT consider only memory related ports 2, 3 and 4? Or do they consider all the ports including 0, 1 and 5 which are ALU related?
It makes sense to me thatCPU_CYCLES_UNHALTED.THREAD =UOPS_EXECUTED.CORE_STALL_CYCLES +UOPS_EXECUTED.CORE_ACTIVE_CYCLES as mentioned in the document.
But, how canUOPS_EXECUTED.PORT015_STALL_CYCLES be greater thanUOPS_EXECUTED.CORE_STALL_CYCLES?
My aim is tosegregatememory related stalls from the total stalls that include both memory related stalls + ALU related stalls.
It doesn't make sense to use UOPS_EXECUTED.PORT234_CORE, because it said from Dr. Levinthal's doc -
"The signals used to count the memory access uops executed (ports 2, 3 and 4) are the
only core events which cannot be counted on a logical core or HT basis...the ALU ports (0,1,5) count on a
per thread basis"
"Thus in the case where HT is
enabled we have the following inequality
UOPS_EXECUTED.CORE_STALL_CYCLES <= True execution stalls per thread <=UOPS_EXECUTED.PORT015_STALL_CYCLES
Of course with HT disabled then
UOPS_EXECUTED.CORE_STALL_CYCLES = True execution stalls per thread = UOPS_EXECUTED.PORT015_STALL_CYCLES"
In most of cases, HT is enabled in system, simply use UOPS_EXECUTED.CORE_STALL_CYCLESwhatever HT isenabled or NOT, to reduce the complexity.