Link between Intel PCM and SDM Volume 3 for memory bandwidth monitoring

Manuel_S_ · ‎06-07-2016

Hi all,

On a single socket Haswell processor (06_60) I can get memory read and write bandwidth using the getSocketCounterState() method and passing socket id = 0 to it. Running pcm.x also report these memory read and write bandwidth at the socket level.

I want now to understand which events are used to get these values. More precisely, I want to know which event in the SDM volume 3 chapter 19.4 is used (one for memory writes and one for memory reads). Unfortunately, looking at the source code of PCM I was not able to find this information. The deepest place of the code I reached is the following one in client_bw.h:

#define PCM_CLIENT_IMC_BAR_OFFSET       (0x0048)
#define PCM_CLIENT_IMC_DRAM_IO_REQESTS  (0x5048)
#define PCM_CLIENT_IMC_DRAM_DATA_READS  (0x5050)
#define PCM_CLIENT_IMC_DRAM_DATA_WRITES (0x5054)
#define PCM_CLIENT_IMC_MMAP_SIZE        (0x6000)

I guess these definitions are related to my question but I can't understand right now, where those numbers comes from. Can anyone give hints on the way memory bandwidth monitoring is implemented on Haswell single socket ?

Thank for your help,

Manu

Note: the I am not able to run the ./pcm-memory.x binary while I can run /.pcm.x I guess it's because I have a single socket and it seems pcm-memory is for multiple sockets systems only, but the error message is not clear on that ?

Patrick_L_Intel · ‎06-07-2016

Hi Manuel,

Please refer to https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel for programming client memory performance counter.

PCM-memory only supports server uncore, so you will not be able to use it with client system.

Sincerely,

Patrick

Manuel_S_ · ‎06-07-2016

Hi Patrick,

Thank you very much for the accurate and fast answer. If my understanding is correct, it means accessing these counters is totally different from accessing all the "other PMU ones" such as described in chapter 18 and 19 of the SDM manual. It seems that these memory counters are not documented at all in the SDM. Is this true and if yes why ?

Best regards and thank again,

Manu

McCalpinJohn · ‎06-07-2016

The "Uncore" performance counters in the server parts are documented in separate volumes, not in the SWDM.

For Xeon E5 v3 (Haswell EP) the document is called the "Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual". It is Intel document 331051 and right now it is available at http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html

Manuel_S_ · ‎06-07-2016

Hi John,

Thank you for the provided link. Nevertheless, I still have questions and I think I also need to clarify the terminology regarding performances counters used by Intel.

My machine is a single socket Haswell with family and model numbers equal to 06_60 in decimal. As so, I guess it is not an haswell EP and thus is not included in what you call "server parts". Is it correct ?

My machine is thus called a "client", is it the terminology used by Intel ?

Thank you again all for your help,

Manu

McCalpinJohn · ‎06-08-2016

The terminology is sometimes tricky and sometimes changes from generation to generation.

You are using the DisplayFamily_DisplayModel to figure out which part you have -- this is exactly the right way to do it.

There are at least two different "uncore" implementations in use, and these are what lead to the difference in "uncore" counters. One of the "uncore" implementations is typically associated with "client" parts (e.g., Core i3/i5/i7), but is also used for the Xeon E3 server processors. The other major "uncore" implementation is typically associated with "server" parts (e.g., Xeon E5), but is also used for a few high end "Core i7" products (e.g., Core i7-58xx and Core i7-59xx). (The easiest way to distinguish between the two is that the "client" uncore supports 2 DRAM channels and the "server" uncore supports 4 DRAM channels, but these details may change in the future.)

A good reference for the different models is Table 35-1 in Volume 3 of the SWDM. Translating your "06_60" from decimal to hex, we get 06_3CH, which is described in Table 35-1 as "4th Generation Intel Core processor and Intel Xeon processor E3-1200 v3 product family based on Haswell microarchitecture". Based on this 06_3CH, the uncore performance counter events in Table 19-9 will apply to this processor.

Unfortunately, the memory controller performance counters described at https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel are much more difficult to use. When I tested them on an earlier processor (Xeon E3-1270 -- a Sandy Bridge single-socket server based on the "client" uncore) they appeared to work correctly. They are only 32 bits wide, so they have to be read fairly frequently to prevent loss of information due to wrap-around, but the bigger problem is with software support. My test code opened the Linux /dev/mem device driver and used an "mmap()" call to get a pointer to the beginning of the IMC BAR region. I could then read the counter values by loading 32-bit integer values using offsets from this pointer. The only good news is that these counters don't need to be programmed, so I can open /dev/mem in read-only mode, but this is still meddling with the hardware at a level where mistakes can cause serious trouble. So the counters exist and they can be used, but software support is limited -- at least I presume that some software product can use these counters (maybe VTune?) otherwise they would probably not have been documented.....

Manuel_S_ · ‎06-08-2016

Thank you again for sharing your experience through complete answer John. I already spent *a lot* of time in this SDM, and never noticed table 35-1, it's now printed and pinned in my office wall !

My initial goal (the one that initiated this post) was to count the number of bytes read and written from/to the memory of my 06_60 based workstation. According to your answer *and* to the code I saw in the Intel PCM library (doing exactly what you described above: opening and mmap from /dev/mem) it should be theoretically possible to read the number of memory requests (and thus compute effective bandwidth) using these "memory-mapped counters" that don't need to be programmed as opposed to "classic" PMU model specific registers.

Now following your advice that this solution could not be reliable (I am not sure that I understood why from your previous answer) in all cases, I am wondering how I can use "classic" PMU MSRs to measure the bandwidth. I can't see any events in table 19.9 (that you mentioned) related to number of requests served by the integrated memory controller. Am I missing something ?

Finally, I know there are also "core classic PMU MSRs" that can count the number of read and write requests from the core and with a particular kind of RESPONSE. Maybe the memory bandwidth could be computed from that also. But this would require to program the PMU MSRs on all the cores and then to sum the results. Also, this will only account for memory requests explicitly coming from cores, and not from cores prefetcher nor from other devices such as GPU, or I/O devices. Is it correct ?

-Manu

McCalpinJohn · ‎06-09-2016

I have not seen any evidence that the memory-mapped DRAM counters in the "client" uncore can give bad answers, but you do have to be more careful to read the counters frequently with these narrower counters. A common approach is to read the counters with another thread and keep the accumulated deltas in a 64-bit data structure. It is also convenient to include this as part of the kernel scheduler interrupt, since that can be used to save/restore performance counters on interrupts and maintain independent 64-bit "virtual counters" for each process that you want to measure. Some of my monitoring codes use a separate user-space process that read the counters, sleeps for a while, reads the counters, etc, and then dumps the accumulated counts at the end of the job. This works fine, but requires extra thought and extra code, and (most importantly), does not allow me to do my "tricks" to get very low-overhead counter reads.

I can get very low overhead with the core counters if the kernel sets the CR4.PCE bit (which is set in recent Linux kernels). This allows user-space execution of the RDPMC instruction, which has an overhead of ~25-40 cycles (depending on the particular processor) vs 2500 or more cycles to read the counters through the Linux "perf events" API. I can get low overhead for memory-mapped IO counters (in either PCI configuration space or in the memory-mapped BARs) by opening /dev/mem and doing an mmap() to get a user-space pointer to the region. When I execute 32-bit loads using offsets from that pointer, the hardware turns them into uncached loads, so they are not blindingly fast, but they are still something like 10x faster than going through the kernel APIs. The only performance counters that can't be read using these low-overhead tricks are the MSR-only counters, since the RDMSR instruction can only be executed in kernel mode. The architecture of the Linux /dev/cpu/*/msr device drivers is not helpful here, since you can only do one MSR read per call. (There is a "length" field, but if it is greater than 8 Bytes the driver just reads the same MSR repeatedly and returns the result from the final read -- a completely useless "feature"). The only way to reduce the overhead for these MSR performance counter reads is to build another device driver that can return the results of many reads in a single call. This is not hard if you don't do any error checking, but requires a lot of experience to get right if you want it to be robust enough to be deployable. (My primary concerns are about the safety of having the kernel copy blocks of data back to user space -- if the user provides a pointer to the wrong location, the elevated privileges of the kernel could allow overwriting important things. I have some personal experience at hanging a system by making errors of this class....)

I agree that the set of "uncore" events in Table 19-9 does not look helpful for getting memory bandwidth.

The best way to get the bandwidth is via the memory-mapped counters, but you can get approximations using the "offcore response" events in the core performance counters. Some of the control bits are broken on Haswell (i.e., compare Tables 18-36 and 18-47 in Vol 3 of the SWDM), and some transactions (like writebacks from L3 to memory) don't really fit into the model of "request"/"response" used by the offcore response events. Disabling the hardware prefetchers (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors) helps the offcore response events make more sense.

You are correct that even if these offcore response events could measure all types of core-related memory traffic and you measured the events using all of the cores, you would still not see DRAM traffic from any type of IO. The memory-mapped DRAM counters can count all traffic. There are also counters that report CPU, GPU, and other IO separately, but I have not looked at these to see if the results make sense. Based on where they are and what distinctions they are trying to make, I would guess that they are probably good, but I have often been wrong about what is easy to get right and what is difficult to get right. In any case it should not be that hard to test them -- at least qualitatively.

Manuel_S_ · ‎06-09-2016

Thank you very much, the picture is (almost) perfectly clear for me now.

One last question (but may be this is not the correct place to ask because it's no more related to the original question, so let me know if I should move it): what are you calling "the MSR-only counters". In other words, from the list of events in the Intel SDM, how can I identify which of them can be accessed through RDPMC in user land and which can not ?

McCalpinJohn · ‎06-09-2016

The MSR-only counters are the uncore counters accessed by MSRs, such as those described in Table 19-9 of volume 3 of the SWDM.

The other uncore counters use memory-mapped access (either in the PCI config space address range or in PCI BARs). These are usually accessed through a kernel device driver, but the hardware does allow user-mode access to these addresses if the kernel driver allows a user-mode mmap() to the corresponding address range.

The core performance counters (both fixed-function and programmable) can be accessed by MSRs (kernel only) or by RDPMC (kernel or user space if CR4.PCE=1).