Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Error reading llc_misses event in Xeon D-1540

Roberto_R_
Beginner
1,293 Views

Hello everyone, 

I am working in a tool that permits to access the different hardware events through performance counters (PMC). This tools works great I have tested in several Intel processors, SandyBridge, Haswell and Haswel-EP. Now I am working with a Broadwell processor that has some new cache monitoring features I need to work with. 

Trying my tool in this processor I found that the events, described in 64-ia-32-architectures-software-developer-manual-325462.pdf Table 19.1, LLC Reference (2EH, Umask 4FH) and LLC Misses (2EH, Umask 41H) report the same number. 

I though this could be an error from my tool so I tried perf and I got the same error. Also I can use only 4 programable PMCs, it is supposed to have 8 programmable PMCs, if I tried to use a 5th PMC it returns zero, same happend with perf.

My processor is:

Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz
Vendor    : GenuineIntel
Family    : 6
Model    : 6
Stepping: 2
Type    : OEM

The perf output is:
$ perf stat -I 1000 -e instructions:u,cycles:u,cache-references:u,cache-misses:u  ./benchmarks/spec2006/mcf06 
#           time             counts unit events
     1.000170323      1.513.051.613      instructions:u           
     1.000170323      1.966.105.738      cycles:u                 
     1.000170323         82.844.475      cache-references:u       
     1.000170323         82.845.042      cache-misses:u           
     2.000390792        985.271.999      instructions:u           
     2.000390792      2.597.388.127      cycles:u                 
     2.000390792         77.201.120      cache-references:u       
     2.000390792         77.200.636      cache-misses:u           
     3.000546036        928.783.029      instructions:u           
     3.000546036      2.597.151.649      cycles:u                 
     3.000546036         73.133.856      cache-references:u       
     3.000546036         73.133.954      cache-misses:u           
     4.000699910        906.864.990      instructions:u           
     4.000699910      2.597.354.693      cycles:u                 
     4.000699910         73.593.433      cache-references:u       
     4.000699910         73.593.252      cache-misses:u     

In my tool the LLC_misses is exactly the same that LLC_references, in perf there is a little difference because of the way perf works. I think these is a bug in the processor or an error in the manual. Does somebody know about these error? Thanks in advance for your suggestions and comments.

 

0 Kudos
7 Replies
Roberto_R_
Beginner
1,293 Views

Hello Everyone,

I talked with the vendor and he thinks my processor is damaged. I have to return my server and he will replace it, actually because the processor is solded on the motherboard he has to replace the whole system. 

With event code=0x2e not matters the umask used (0x4f, 0x41, 0x71, 0xf, 0x1 or 0x7f) the counter always returned the same value, not a fixed value but the same for all events. 

Replace my server will take some weeks, I live in Spain and the processor has to be imported from USA. Some of you can access to this processor and could confirm that my problems are hardware bugs because a faulty processor?

My problems are:

1. Event 0x2e always returned the same value no matters the umask used

2. I have access only to 4 programmable counters (PMC).

Thanks for your feedback.

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,293 Views

Volume 3 of the Intel SW Developer's Manual (Table 18-37) says that you can only use all 8 counters per core if the system is booted with HyperThreading (sometimes called "Logical Processors") disabled.  With HyperThreading enabled, each "Logical Processor" can only access 4 of the 8 programmable core performance counters.

Some estimates of the expected cache miss rate values can be obtained by looking at the extensive measurements of the SPEC 2006 benchmarks reported at http://www.jaleels.org/ajaleel/workload/

For the mcf benchmark running the reference input set, the chart above suggests a cache miss rate of about 60 misses per 1000 instructions for a 256 KiB unified L2 cache and a miss rate of about 12 misses per 1000 instructions for a 12 MiB unified L3 cache.  Looking at the results above, the first trial gave just under 83M misses in 1.5B instructions, which is a rate of about one miss per 55 instructions.    This is close to what you would expect without an L3 cache, so I would be concerned about the hardware.     Unfortunately these processors are so new that there is not much documentation for other counters (such as memory controller counters) that might be useful for checking to see if the L3 is working correctly.

An alternative check would be to use the SPEC CPU2006 cache hit rate charts above to pick a code and input set that are expected to have a modest miss rate for the 256 KiB L2, but (very close to a) zero miss rate for a 12 MiB L3 cache.   Examples include any of the input sets for 401.bzip2, any of the input sets for 464.h264ref, either of the input sets for 456.hmmer, the "checkspam" or "splitmail" inputs for 400.perlbench, and the 482.sphinx3 benchmark.

I would also run these tests with the hardware prefetchers disabled to see if that changes the counts in a useful way.  This is described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

 

0 Kudos
Roberto_R_
Beginner
1,293 Views

John thanks for your answer.

I actually have HT disabled,

$ lscpu | grep -i -E  "^CPU\(s\):|core|socket"
CPU(s):                8
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1

I tried 464.h264ref because is the one with the smallest working set. I took a sample every second.


nsample    pid                Instr          Cycles          LLC_requests LLC_Misses  L2_requests    L2_misses     LLC_Usage (Bytes)  
      1   1042           5493409057    2111392416      12256433      12256433      39790416      12257650      10125312     
      2   1042           7325193416    2586067847      19203738      19203738      50093591      19204266      10846208      
      3   1042           7401938326    2576543380      20886234      20886234      49512747      20886660      11862016      
      4   1042           7473959529    2576339358      21857777      21857777      49560115      21858200      12255232      

The LLC seems to work correctly, I am showing the result for the LLC_usage from Intel CMT, and the benchmark uses the whole cache, I can modify the way allocation through Intel CAT, it also seems to work well. When I modify the ways assigned to the running benchmark the LLC_usage changes. Of course could be great to have access to one off chip event to check it.

I need the LLC_misses for estimate the MPKI curves. But I don't know why that event is not working when everything seems to work great. Thanks again.

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,293 Views

These results are certainly suggestive of a broken performance counter event.... :-(

The Broadwell processors should support the "Offcore Response Counter" events (0xB7 and 0xBB), which can be programmed to count LLC misses.  It requires a fair investment in time to understand how to program the counters (and some combinations that seem reasonable don't appear to work), but it may provide an alternative source of information.

From the 464.h264ref results, the L2 access rate is definitely higher than Jaleel's results -- you are seeing about 7 L2 accesses per 1000 instructions, while Jaleel's charts show only about 2.6 L1 misses per 1000 instructions.   The difference could be due to L1 prefetches (it is very hard to tell if the performance counters differentiate between L1 demand misses and L1 hardware prefetch misses), or perhaps due to significantly different code generation.  (I don't know the h246ref code well enough to guess whether it is amenable to significant variability in code generation.   Looking at instructions is always a bit risky, since compilers may optimize for execution time in ways that give very different instruction counts.)  The L2 miss rate is also higher than Jaleel's results, but not by such a large ratio.

0 Kudos
Roberto_R_
Beginner
1,293 Views

John thanks again for your answer.

Our hardware vendor realized that this problem also becomes apparent in identical machines they have for testing (same CPU, same motherboard). We were wondering if you have you observed this performance-counter issue in other Broadwell processors at Intel.

0 Kudos
Ji_X_
Beginner
1,293 Views

Hi everyone,

       I met with this problem too. I want to know can we fix the problem now?

Thanks,

0 Kudos
Roberto_R_
Beginner
1,293 Views

Hello Xu,

We have not solved the problem, we contacted our hardware vendor, they tested some machines they have for testing and the error is also present in that machines, It seems to be a hardware error. 

Roberto

0 Kudos
Reply