I want to count number of retired uops for a benchmark on an Intel Core i7 Processor (Haswell). I just have few confusions on my understanding of the available counters. So, I have figured out that I can use counter event code 0xc2 and umask of 0x01 for this purpose. My question is do these event codes depend on the actual processor or not ? Secondly does the event UOPS.RETIRED count fused micro-ops as two simple uops (in other words is this counter incremented by 2 when a fused uop retires) ?
Thanks for your time.
Chapter 19 of Volume 3 of the Intel Architecture Software Developer's Manual (document 325384) includes tables of the performance counter events for each processor family. Event 0xC2, Umask 0x01 is not one of the "architectural" events listed in Table 19-1, but that event is listed with a name consistent with "UOPS_RETIRED" in most of the sections of Chapter 19. It looks like when you get back as far as the "Core" architecture, the details of the event definition start to change.
The documentation is not clear about how the various fused uops are handled. If you go back as far as Nehalem (Table 19-17 of Volume 3 of the SWDM), the notes say that "macro-fused" uops increment the counter once, "micro-fused" uops increment the counter twice, and all other cases increment the counter once. It is not clear whether this implementation detail applies to newer processor cores, but this is fairly easy to measure.
The Haswell core has a bug that causes the "INSTRUCTIONS_RETIRED" event to sometimes overcount or undercount. If I recall correctly, this same bug applies to UOPS_RETIRED as well, but I can't find my notes on it right now. Fortunately in my test cases the errors were repeatable, so they did not cause me much trouble, and I only saw the errors in a few fairly small pieces of code. (This might mean that the overcounting and undercounting canceled out most of the time?)
The "Intel ® Xeon Processor E3-1200 v3 Product Family Specification Update, October 2016" lists only UOPS.EXECUTED to may untercount (HSW30) and further inaccuracies with HT enabled (HSW144) http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spe...
The "Intel® Xeon® Processor E5 v3 Product Family Processor Specification Update September 2016" lists Instructions_Retired to not count consitently (HSE71) http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-up...
The "Desktop 4th Generation Intel® Core™ Processor Family, Desktop Intel® Pentium® Processor Family, and Desktop Intel® Celeron® Processor Family Specification Update January 2017" also lists UOPS.EXECUTED to may untercount (HSD30) and further inaccuracies with HT enabled (HSD144) http://www.intel.com/content/www/us/en/processors/core/4th-gen-core-family-desktop-specification-upd...
Welcome to the wonderful world of inconsistent product documentation!
I found my notes on the Haswell overcounting bug at http://www.agner.org/optimize/blog/read.php?i=452&v=t ;
Other similar loops had no (net) counting errors. (They might have had multiple offsetting errors?).
I was not able to identify any particular instruction type or instruction combination that caused this overcounting, but I did not look particularly hard.