memory bandwidth in vtune

oliver_cscs · ‎08-24-2012

Hello,
I am using Intel VTune Amplifier XE 2011 to measure the performance of our hpc applications on a cluster consisting of Intel Xeon E5-2670. One of the most important metrics for us is the memory bandwidth and it would be interesting to know how it is measured. Which performace counters are involved and what formula is used to calculate this derived metric?

Thank you in advance,

Oliver

perfwise · ‎08-27-2012

Not directly answering your question.. but you could monitor PMC 0x34 and tabulate the total # of:

misses in the L3: Read and Write Invalidate requests

writebacks from L3: not instantaneously attributable to Memory bandwidth but you can monitor the # of writes to the L3 which are modified.

I believe by monitoring this.. you're close to the memory bandwidth in most applications. I don't know how you'd detect non-temporal store traffic though.. but that's not very common.

Perfwise

Roman_D_Intel · ‎08-27-2012

Hi Oliver,

Intel Xeon E5-2670 has a hardware performance monitor in the integrated memory controller enabling direct measurement of consumed memory bandwidth. For details you can consult the uncore performance monitoring manual. Table 2-65 "Metrics derived from iMC Events" contains formulas for memory bandwidth computation.

Best regards,

Roman

oliver_cscs · ‎08-28-2012

Thank you, these answers are very helpful.

Just one more question concerning Intel PCM: We need to analyze our code on production systems and it is therefore not an option to use anything that requires root access.

Do you know if there is a way to run Intel PCM in user space?

Regards,

Oliver

Roman_D_Intel · ‎08-28-2012

Hi Oliver,

For the Intel PCM on-core metrics (IPC, cache statistics, etc) it would be sufficient to adjust the permissions/ownership of /dev/cpu/*/msr file devices to allow non-root access (with chown/chmod).

But for the uncore metrics on Intel Xeon E5-2670 like memory bandwidth the root access is required. Otherwise the metrics will always show zeros.

Since Intel PCM is an experimental SDK (open source sample code) you can write your own monitoring daemon based on Intel PCM that runs under root account and sends the processor metrics to your non-root user processes. This would be an alternative.

Best regards,
Roman

McCalpinJohn · ‎09-26-2012

Making /dev/cpu/*/msr world writable certainly opens up the opportunity for killing the system in a wide variety of ways. ;-) A slightly safer approach would be to make /dev/cpu/*/msr group writable (with a unique group) and build a setgid application for that group that only reads and writes the performance monitoring MSRs. We went one step further and built a variant of the /dev/cpu/*/msr kernel driver that limits access to whitelisted registers in the loadable kernel module. This was partly to reduce exposure to user-level mistakes and partly to improve the efficiency of MSR access by allowing a single device driver call to return many MSR reads. The project is almost ready for operation on our Westmere EP and AMD processors under 2.6.18 kernels, and there should be a branch that works on 2.6.32 kernels (which had some gratuitous kernel API changes). Unfortunately it needs some fairly significant extensions to handle the PCI configuration space accesses needed for Sandy Bridge EP. The code is available at https://github.com/jhammond/pmc Benefits: Easy to get access to the low-level hardware for all MSR-based counters in core and uncore Sets CR4.PCE to allow user-mode execution of the RDPMC instruction All bits are accessible (except those that cause the performance counters to generate interrupts -- we mask those off) No pesky virtualization confounding the results --- measurements are by core, not by process Simple software that is easy to understand and modify Drawbacks: No overflow detection -- this is up to the user No support for any of the interrupt-based performance monitor modes of operation No support for counter multiplexing (which requires some form of virtualization) Only has whitelists for 2-3 processors at the current time No PCI Configuration Space support (required for some of the uncore counters in Sandy Bridge & later processors) Overall, this approach provides the lowest overhead for gathering a single set of PMC and MSR counters per run, with the option for minimum latency inline RDPMC reads around program sections of interest and reduced overhead for multiple-MSR reads inline.