topic Those are very small numbers. in Software Tuning, Performance Optimization & Platform Monitoring

Reading IMC uncore counters, surprising output

Jordi_V_ — Fri, 12 May 2017 08:20:01 GMT

Hi,

I'm trying to get the memory bandwith for some applications on a Xeon E5 V3 (x2), but i'm getting some confusing results. These are the steps I follow to get the counters:

1) I read this devices and functions (devices ID)
7f:14.0 and 7f:14.1 (0x2FB0 and 0x2FB1)
7f:17.0 and 7f:17.1 (0x2FD0 and 0x2FD1)
ff:14.0 and ff:14.1 (0x2FB0 and 0x2FB1)
ff:17.0 and ff:17.1 (0x2FD0 and 0x2FD0)

There are 2 devices per bus (I suppose every bus for a single socket), 2 functions per device and then a total of 4 channels per socket, I suppose.

2) I set the freeze and reset controls and counters per BOX. Then when I want to start the measurements I enable and set the event and mask on CTL registers.

3) I freeze every BOX and read the CTR counters (* 64).

It seems logic, but when I try to understand the results, something is not working. I'm using a dummy work function (simple matrix multiplication) and test. The output is something like that:

ff/14.0: 727
ff/14.1: 1341
ff/17.0: 763
ff/17.1: 1336
7f/14.0: 67062
7f/14.1: 3382
7f/17.0: 66813
7f/17.1: 3463

As you can see, two IMCs (for the same socket, I suppose) are sending data. But they are not sending data using it's channel 1, just using it's channel 0. As far as I understand, all CPUs are trying to take advantage of all their channels. In my case, I have 4 channels per CPU (socket), but just two are sending data to main memory.

If I use a third party library for the same measurements, like PAPI, the results are completely different, the 4 channels are working employing the same dummy work function.

But the results are different in stress applications, when I want to get the max bandwith. Those result are very similar for both my measurements and third party measurements, which makes me even more surprissed.

So the question is, what can be happening? I'm reading bad? Is the third party getting wrong results although they seem more logical results?

Thank you.

Those are very small numbers.

McCalpinJohn — Fri, 12 May 2017 20:02:32 GMT

Those are very small numbers...

The IMC counters will count traffic from all sources (user code, OS code, and IO), so there is always some amount of background activity to be dealt with.

I recommend using something like STREAM to test memory controller counters. If compiled with streaming stores (and an adequately large array size), the number of DRAM reads is about:

(6 arrays read *

8 Bytes per array element *

STREAM_ARRAY_SIZE elements per array *

NTIMES repetitions)

divided by 64 Bytes/DRAM CAS read

The number of DRAM writes is the 4/6 of the number of reads.

If the code is compiled without streaming stores, the number of DRAM reads is increased from 6 to 10 (because the store targets must be read from memory before being overwritten), while the number of DRAM writes is unchanged.

There is some additional DRAM traffic associated with both the initialization of the arrays (which is hard to summarize) and the validation of the results (requiring 1 more read of each of the 3 arrays). If you run with NTIMES=10 and NTIMES=20 and take the difference in the DRAM counts, it should be very close to the traffic for 10 iterations (i.e., this should cancel out the initialization and validation traffic).

With data on small pages, there is typically a small excess in read traffic for loading page table entries -- 3% is typical.

With streaming stores, there is typically a small increase in writes due to prematurely flushed write combining buffers -- usually less than 1%.

Quote:McCalpin, John wrote:

Jordi_V_ — Tue, 16 May 2017 11:20:07 GMT

McCalpin, John wrote:

Those are very small numbers...

The IMC counters will count traffic from all sources (user code, OS code, and IO), so there is always some amount of background activity to be dealt with.

I recommend using something like STREAM to test memory controller counters. If compiled with streaming stores (and an adequately large array size), the number of DRAM reads is about:

(6 arrays read *

8 Bytes per array element *

STREAM_ARRAY_SIZE elements per array *

NTIMES repetitions)

divided by 64 Bytes/DRAM CAS read

The number of DRAM writes is the 4/6 of the number of reads.

If the code is compiled without streaming stores, the number of DRAM reads is increased from 6 to 10 (because the store targets must be read from memory before being overwritten), while the number of DRAM writes is unchanged.

There is some additional DRAM traffic associated with both the initialization of the arrays (which is hard to summarize) and the validation of the results (requiring 1 more read of each of the 3 arrays). If you run with NTIMES=10 and NTIMES=20 and take the difference in the DRAM counts, it should be very close to the traffic for 10 iterations (i.e., this should cancel out the initialization and validation traffic).

With data on small pages, there is typically a small excess in read traffic for loading page table entries -- 3% is typical.

With streaming stores, there is typically a small increase in writes due to prematurely flushed write combining buffers -- usually less than 1%.

Sorry for the delay, I had login problems.

I used stream to test before, but worked as expected. The problem is just in small test applications like this "dummy work function". Although the bandwith used is small, for me it has no sense that just 2 of 4 channels are in use. Maybe I have a lack of theory, but as I understand, the CPU will try to send that data through all 4 channels.

What could be happening in small bandwith applications? Why in heavy bandwith applications works as expected (filling all expected channels)?

Thank you.

It is very hard to interpret

McCalpinJohn — Wed, 17 May 2017 15:43:56 GMT

It is very hard to interpret performance counter results unless you have a model of what to expect...

With numbers this small it entirely possible that the small values are the "signal" and the larger values are "noise" due to OS polling on a few number of specific addresses (each of which will map to a single memory controller, of course).

One way to test this is to run a few benchmarks that are expected to have no memory traffic (other than loading the program text) and monitor the range of DRAM accesses as a function of runtime and the number of cores used.