Hi, I'm using PAPI counter to measure my C++ code, while other results are expected, the CPU cycles (e.g., CPU_CLK_UNHALTED.THREAD_P) is always around 4 times less than retired uops (e.g, UOPS_RETIRED:ANY). Is there something wrong? However, as to what I know, Intel Xeon E5 retires around 4 uops per cycle, then it means almost all the CPU cycles are spent on useful computation time (obviously wrong).
Besides, I also get the number of retired Uops (e.g, UOPS_RETIRED:ANY) larger than that of issued uops (e.g, UOPS_ISSUED:ANY), but it does not make sense. Right?
Background, my code should be a memory-bound application, but the computation time reflected by UOPS_RETIRED:ANY is already over 80% of running time. Could you please help me find the reason?
There are many ways to get confusing results with performance counters....
Some Intel processors have bugs in performance counter events that can lead to overcounting and/or undercounting. On the Xeon E5 v3 (Haswell) cores, there is a published errata with the INSTRUCTIONS_RETIRED.ANY event that I have seen in one of my codes. (A loop reported 20% more instructions retired than were actually present.) I don't see a similar errata on UOPS_RETIRED, but I am pretty sure that I saw the same overcounting in UOPS that I saw in INSTRUCTIONS.
PAPI is a convenient interface, but there is also a lot of overhead associated with the counter virtualization. I prefer to do my testing with each execution thread bound to a single logical processor and then use inline assembly to execute RDPMC instructions to read the counters (e.g., https://github.com/jdmccalpin/low-overhead-timers). ; Even in this case, I still get confusing results sometimes because it is not possible to read multiple counters atomically. E.g., if an interrupt occurs in the middle of reading a set of counters, the differences between the "before" and "after" counts may not be consistent.