Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Bug? pcm-numa reports all DRAM access as remote on E5-2689 v4

Florian_H_
Beginner
401 Views

I'm running the Intel Memory Latency Checker on a set of servers.
pcm-numa reports all memory access which occurs as part of the test as "Remote DRAM Accesses",
but from the latency matrix output it is obvious that there was both local and remote access, so the output of pcm-numa is incorrect.

This occurs only on servers with E5-2689 v4 CPUs. I am wondering if anyone has an idea what could be causing this.

Running the latency checker:

# ./mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --latency_matrix

Using buffer size of 200.000MB
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1
       0          87.1   127.2
       1         126.4    86.6

Output of reading performance counters in parallel:

#./pcm-numa.x 30

 Intel(r) Performance Counter Monitor: NUMA monitoring utility
 Copyright (c) 2009-2016 Intel Corporation

Number of physical cores: 20
Number of logical cores: 20
Number of online logical cores: 20
Threads (logical cores) per physical core: 1
Num sockets: 2
Physical cores per socket: 10
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 8
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
Package thermal spec power: 165 Watt; Package minimum power: 60 Watt; Package maximum power: 309 Watt;
Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Trying to use Linux perf events...
Can not use Linux perf because OffcoreResponse counter usage requested. Falling-back to direct PMU programming.
Socket 0
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Socket 1
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

Detected Intel(R) Xeon(R) CPU E5-2689 v4 @ 3.10GHz "Intel(r) microarchitecture codename Broadwell-EP"
Update every 30.0 seconds
Time elapsed: 30743 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   0.87         98 G      113 G         0                 837 K
   1   0.21         23 G      113 G         0                  77 M
   2   0.29         32 G      113 G         0                 209 K
   3   0.29         32 G      113 G         0                 113 K
   4   0.29         32 G      113 G         0                 239 K
   5   0.29         32 G      113 G         0                 275 K
   6   0.29         32 G      113 G         0                 400 K
   7   0.29         32 G      113 G         0                 234 K
   8   0.29         32 G      113 G         0                 216 K
   9   0.28         32 G      113 G         0                  82 K
  10   0.88        100 G      113 G         0                 264 K
  11   0.21         24 G      113 G         0                  76 M
  12   0.28         32 G      113 G         0                3097
  13   0.28         32 G      113 G         0                2115
  14   0.28         32 G      113 G         0                1888
  15   0.28         32 G      113 G         0                1880
  16   0.28         32 G      113 G         0                1898
  17   0.28         32 G      113 G         0                1861
  18   0.28         32 G      113 G         0                1878
  19   0.28         32 G      113 G         0                2134
-------------------------------------------------------------------------------------------------------------------
   *   0.34        765 G     2274 G         0                 157 M

 

0 Kudos
4 Replies
Roman_D_Intel
Employee
401 Views

Hi Florian,

thanks for the report. I am trying to reproduce it. Could you send/attach your /proc/cpuinfo file? Did you disable/offlined cores on the second socket? I see only 20 logical cores in pcm output.

Thank you,

Roman

0 Kudos
Florian_H_
Beginner
401 Views

Hi Roman,

there are 20 cores since hyperthreading is disabled.  /proc/cpuinfo is attached.
I'm using Intel® PCM version 2.11.

Let me know if you need any additional information.

Regards,

Florian

0 Kudos
Roman_D_Intel
Employee
401 Views

Hi Florian,

I successfully reproduced the issue using PCM V2.11. unfortunately this version used wrong CPU event codes for Broadwell-EP. The next (unreleased) PCM version fixes it. As an immediate workaround please apply the following fix. In pcm-numa.cpp change

conf.OffcoreResponseMsrValue[0] = 0x0604008FFF;
conf.OffcoreResponseMsrValue[1] = 0x067BC08FFF;

for PCM::BDX case.

Please let me know if the fix works for you.

Thanks,

Roman

0 Kudos
Florian_H_
Beginner
401 Views

Fixed, Thank you!

Florian

0 Kudos
Reply