Hi, I'm using PAPI counter to measure my C++ code, while other results are expected, the CPU cycles (e.g., CPU_CLK_UNHALTED.THREAD_P) is always around 4 times less than retired uops (e.g, UOPS_RETIRED:ANY). Is there something wrong? However, as to what I know, Intel Xeon E5 retires around 4 uops per cycle, then it means almost all the CPU cycles are spent on useful computation time (obviously wrong).
Besides, I also get the number of retired Uops (e.g, UOPS_RETIRED:ANY) larger than that of issued uops (e.g, UOPS_ISSUED:ANY), but it does not make sense. Right?
Background, my code should be a memory-bound application, but the computation time reflected by UOPS_RETIRED:ANY is already over 80% of running time. Could you please help me find the reason?
- Parallel Computing
There are many ways to get confusing results with performance counters....
Some Intel processors have bugs in performance counter events that can lead to overcounting and/or undercounting. On the Xeon E5 v3 (Haswell) cores, there is a published errata with the INSTRUCTIONS_RETIRED.ANY event that I have seen in one of my codes. (A loop reported 20% more instructions retired than were actually present.) I don't see a similar errata on UOPS_RETIRED, but I am pretty sure that I saw the same overcounting in UOPS that I saw in INSTRUCTIONS.
PAPI is a convenient interface, but there is also a lot of overhead associated with the counter virtualization. I prefer to do my testing with each execution thread bound to a single logical processor and then use inline assembly to execute RDPMC instructions to read the counters (e.g., https://github.com/jdmccalpin/low-overhead-timers). ; Even in this case, I still get confusing results sometimes because it is not possible to read multiple counters atomically. E.g., if an interrupt occurs in the middle of reading a set of counters, the differences between the "before" and "after" counts may not be consistent.