I am working in a tool that permits to access the different hardware events through performance counters (PMC). This tools works great I have tested in several Intel processors, SandyBridge, Haswell and Haswel-EP. Now I am working with a Broadwell processor that has some new cache monitoring features I need to work with.
Trying my tool in this processor I found that the events, described in 64-ia-32-architectures-software-developer-manual-325462.pdf Table 19.1, LLC Reference (2EH, Umask 4FH) and LLC Misses (2EH, Umask 41H) report the same number.
I though this could be an error from my tool so I tried perf and I got the same error. Also I can use only 4 programable PMCs, it is supposed to have 8 programmable PMCs, if I tried to use a 5th PMC it returns zero, same happend with perf.
My processor is:
Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz
Vendor : GenuineIntel
Family : 6
Model : 6
Type : OEM
The perf output is:
$ perf stat -I 1000 -e instructions:u,cycles:u,cache-references:u,cache-misses:u ./benchmarks/spec2006/mcf06
# time counts unit events
1.000170323 1.513.051.613 instructions:u
1.000170323 1.966.105.738 cycles:u
1.000170323 82.844.475 cache-references:u
1.000170323 82.845.042 cache-misses:u
2.000390792 985.271.999 instructions:u
2.000390792 2.597.388.127 cycles:u
2.000390792 77.201.120 cache-references:u
2.000390792 77.200.636 cache-misses:u
3.000546036 928.783.029 instructions:u
3.000546036 2.597.151.649 cycles:u
3.000546036 73.133.856 cache-references:u
3.000546036 73.133.954 cache-misses:u
4.000699910 906.864.990 instructions:u
4.000699910 2.597.354.693 cycles:u
4.000699910 73.593.433 cache-references:u
4.000699910 73.593.252 cache-misses:u
In my tool the LLC_misses is exactly the same that LLC_references, in perf there is a little difference because of the way perf works. I think these is a bug in the processor or an error in the manual. Does somebody know about these error? Thanks in advance for your suggestions and comments.
I talked with the vendor and he thinks my processor is damaged. I have to return my server and he will replace it, actually because the processor is solded on the motherboard he has to replace the whole system.
With event code=0x2e not matters the umask used (0x4f, 0x41, 0x71, 0xf, 0x1 or 0x7f) the counter always returned the same value, not a fixed value but the same for all events.
Replace my server will take some weeks, I live in Spain and the processor has to be imported from USA. Some of you can access to this processor and could confirm that my problems are hardware bugs because a faulty processor?
My problems are:
1. Event 0x2e always returned the same value no matters the umask used
2. I have access only to 4 programmable counters (PMC).
Thanks for your feedback.
Volume 3 of the Intel SW Developer's Manual (Table 18-37) says that you can only use all 8 counters per core if the system is booted with HyperThreading (sometimes called "Logical Processors") disabled. With HyperThreading enabled, each "Logical Processor" can only access 4 of the 8 programmable core performance counters.
Some estimates of the expected cache miss rate values can be obtained by looking at the extensive measurements of the SPEC 2006 benchmarks reported at http://www.jaleels.org/ajaleel/workload/
For the mcf benchmark running the reference input set, the chart above suggests a cache miss rate of about 60 misses per 1000 instructions for a 256 KiB unified L2 cache and a miss rate of about 12 misses per 1000 instructions for a 12 MiB unified L3 cache. Looking at the results above, the first trial gave just under 83M misses in 1.5B instructions, which is a rate of about one miss per 55 instructions. This is close to what you would expect without an L3 cache, so I would be concerned about the hardware. Unfortunately these processors are so new that there is not much documentation for other counters (such as memory controller counters) that might be useful for checking to see if the L3 is working correctly.
An alternative check would be to use the SPEC CPU2006 cache hit rate charts above to pick a code and input set that are expected to have a modest miss rate for the 256 KiB L2, but (very close to a) zero miss rate for a 12 MiB L3 cache. Examples include any of the input sets for 401.bzip2, any of the input sets for 464.h264ref, either of the input sets for 456.hmmer, the "checkspam" or "splitmail" inputs for 400.perlbench, and the 482.sphinx3 benchmark.
I would also run these tests with the hardware prefetchers disabled to see if that changes the counts in a useful way. This is described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
John thanks for your answer.
I actually have HT disabled,
$ lscpu | grep -i -E "^CPU\(s\):|core|socket"
Thread(s) per core: 1
Core(s) per socket: 8
I tried 464.h264ref because is the one with the smallest working set. I took a sample every second.
nsample pid Instr Cycles LLC_requests LLC_Misses L2_requests L2_misses LLC_Usage (Bytes)
1 1042 5493409057 2111392416 12256433 12256433 39790416 12257650 10125312
2 1042 7325193416 2586067847 19203738 19203738 50093591 19204266 10846208
3 1042 7401938326 2576543380 20886234 20886234 49512747 20886660 11862016
4 1042 7473959529 2576339358 21857777 21857777 49560115 21858200 12255232
The LLC seems to work correctly, I am showing the result for the LLC_usage from Intel CMT, and the benchmark uses the whole cache, I can modify the way allocation through Intel CAT, it also seems to work well. When I modify the ways assigned to the running benchmark the LLC_usage changes. Of course could be great to have access to one off chip event to check it.
I need the LLC_misses for estimate the MPKI curves. But I don't know why that event is not working when everything seems to work great. Thanks again.
These results are certainly suggestive of a broken performance counter event.... :-(
The Broadwell processors should support the "Offcore Response Counter" events (0xB7 and 0xBB), which can be programmed to count LLC misses. It requires a fair investment in time to understand how to program the counters (and some combinations that seem reasonable don't appear to work), but it may provide an alternative source of information.
From the 464.h264ref results, the L2 access rate is definitely higher than Jaleel's results -- you are seeing about 7 L2 accesses per 1000 instructions, while Jaleel's charts show only about 2.6 L1 misses per 1000 instructions. The difference could be due to L1 prefetches (it is very hard to tell if the performance counters differentiate between L1 demand misses and L1 hardware prefetch misses), or perhaps due to significantly different code generation. (I don't know the h246ref code well enough to guess whether it is amenable to significant variability in code generation. Looking at instructions is always a bit risky, since compilers may optimize for execution time in ways that give very different instruction counts.) The L2 miss rate is also higher than Jaleel's results, but not by such a large ratio.
John thanks again for your answer.
Our hardware vendor realized that this problem also becomes apparent in identical machines they have for testing (same CPU, same motherboard). We were wondering if you have you observed this performance-counter issue in other Broadwell processors at Intel.
We have not solved the problem, we contacted our hardware vendor, they tested some machines they have for testing and the error is also present in that machines, It seems to be a hardware error.