- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to analyze some benchmarks and see how much of their stall cycles are related to memory access. I looked at the documents: "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors".
Could anyone suggest how to identify memory related stalls?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please refer to this article.
Measure memory stalls impacts - (This is average value (cycles) for each stall)
UOPS_EXECUTED.CORE_STALLS_CYCLES / UOPS_EXECUTED.CORE_STALLS_COUNT
UOPS_EXECUTED.CORE_STALLS_CYCLES:
Cycle count when no Uops were executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.
UOPS_EXECUTED.CORE_STALLS_COUNT:
Counts when there is 1 or more Uops executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.
Hope it helps.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have looked at the document byDr.Levinthal, but still have some confusion regarding my initial question. Let me ask a few more questions. Does the UOPS_EXECUTED.CORE_STALL_CYCLES/COUNT consider only memory related ports 2, 3 and 4? Or do they consider all the ports including 0, 1 and 5 which are ALU related?
It makes sense to me thatCPU_CYCLES_UNHALTED.THREAD =UOPS_EXECUTED.CORE_STALL_CYCLES +UOPS_EXECUTED.CORE_ACTIVE_CYCLES as mentioned in the document.
But, how canUOPS_EXECUTED.PORT015_STALL_CYCLES be greater thanUOPS_EXECUTED.CORE_STALL_CYCLES?
My aim is tosegregatememory related stalls from the total stalls that include both memory related stalls + ALU related stalls.
Thanks again,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vineeth,
It doesn't make sense to use UOPS_EXECUTED.PORT234_CORE, because it said from Dr. Levinthal's doc -
"The signals used to count the memory access uops executed (ports 2, 3 and 4) are the
only core events which cannot be counted on a logical core or HT basis...the ALU ports (0,1,5) count on a
per thread basis"
"Thus in the case where HT is
enabled we have the following inequality
UOPS_EXECUTED.CORE_STALL_CYCLES <= True execution stalls per thread <=UOPS_EXECUTED.PORT015_STALL_CYCLES
Of course with HT disabled then
UOPS_EXECUTED.CORE_STALL_CYCLES = True execution stalls per thread = UOPS_EXECUTED.PORT015_STALL_CYCLES"
In most of cases, HT is enabled in system, simply use UOPS_EXECUTED.CORE_STALL_CYCLESwhatever HT isenabled or NOT, to reduce the complexity.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please readthis article for optimization guideline for Intel Core i7 processors
Hope it helps.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page