Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4975 Discussions

How to obtain "QPI Bandwidth" viewpoint?

Tim_Day
Beginner
535 Views

I have an memory intensive application running on a dual Xeon E5-2643 under 64bit Win7 Pro.  The workstation BIOS has NUMA and hyperthreading enabled (windows thinks there are two NUMA nodes).  I'm using VTune Amplifier XE 2013 Update 4 (build 270817)

My application allocates and initializes a big chunk of RAM on the main thread, then spins up TBB and does parallel_reduce on a blocked range2d to compute some stuff.

I'm interested in using vtune to get some insight into bandwith issues; in particular trying to get some assessment of how much the threads which run on cores on the "other" node to where the data is allocated are being impacted by having to go over the QPI bus to access the data.

The documentation (e.g http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/lin/ug_docs/GUID-DE99E03A-D0EA-49A0-B8F4-E263A75926ED.htm , although I've also found the same thing in the locally installed help ) suggests there is a "QPI Bandwidth" viewpoint available.  But I cannot figure out how to get to actually get this viewpoint to appear as an option!

I have run both "General Exploration" (with "Analyze Memory Bandwidth" ticked) and "Bandwith" analyses (from the "Sandy Bridge/..." section of "Choose analysis type") and neither of them seems to give me a "QPI bandwidth" viewpoint.  Once run, the "General Exploration" analysis gives me "Handware Event Counts", "Hardware Event Sample Counts", "Hardware Issues", "Hotspots", "Bandwidth", "General Exploration" viewpoints, and the "Bandwidth" analysis gives me the same without the "General Exploration".

The "Bandwidth" viewpoint itself is quite interesting, but it only shows "package_0" (of the 2) as consuming any bandwidth, and the peak is about half of what I compute my whole program is achieving from the rate it processes data at (and Task Manager clearly shows it maxing out both NUMA nodes), so it appears that "Bandwidth" just shows a node's direct interaction with DRAM, and not any QPI traffic to the other node, or a node's traffic to DRAM originating from another node via QPI.  (Googling this topic finds various references to "uncore events" being important, and that vtune doesn't necessarily deal with them that well for some architectures, but the documentation's mention of the "QPI bandwidth" viewpoint gives me some hope it can show me something useful about what "package_1" is doing.)

What's the trick to getting a "QPI Bandwidth" viewpoint to appear ?
Thanks for any help

Tim

0 Kudos
3 Replies
Peter_W_Intel
Employee
535 Views

If you need to measure QPI traffic, this experitemtal tool may help.

0 Kudos
Tim_Day
Beginner
535 Views

Thanks, the PCM stuff looks really useful; will try it out.

But I'm still curious to know what it takes to get the documented "QPI Bandwidth" viewpoint in a vTune analysis though.  Am I just not ticking the right boxes, or is it incompatible with my system (Dell Precision T7600)?  (I'm partly motivated by us recently splurging on a few XE site licenses and it'd be nice to show some results from the tool rather than having to say, well actually vtune can't give us the full picture of what's going on our dual socket workstations and we have to use something else instead :^)

0 Kudos
Tim_Day
Beginner
535 Views

Well I seem to have managed to access the PCM stuff programmatically via PCM::initWinRing0Lib(), PCM::getInstance(), it reports pcm->good() and pcm->program is successful.  Haven't tried doing anything more interesting yet.

On invoking pcm->program(), I get:

Using PCM on your system might have a performance impact as per http://software.intel.com/en-us/articles/performance-impact-when-sampling-certain-llc-events-on-snb-ep-with-vtune
You can avoid the performance impact by using the option --noJKTWA, however the cache metrics might be wrong then.
ERROR: QPI LL counter programming seems not to work. Q_P0_PCI_PMON_BOX_CTL=0x0
       Please see BIOS options to enable the export of performance monitoring devices (devices 8 and 9: function 2).
ERROR: QPI LL counter programming seems not to work. Q_P1_PCI_PMON_BOX_CTL=0x0
       Please see BIOS options to enable the export of performance monitoring devices (devices 8 and 9: function 2).
ERROR: QPI LL counter programming seems not to work. Q_P0_PCI_PMON_BOX_CTL=0x0
       Please see BIOS options to enable the export of performance monitoring devices (devices 8 and 9: function 2).
ERROR: QPI LL counter programming seems not to work. Q_P1_PCI_PMON_BOX_CTL=0x0
       Please see BIOS options to enable the export of performance monitoring devices (devices 8 and 9: function 2).
Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)

However it's not at all obvious to me what, if anything, I should enable in this machine's BIOS to enable such monitoring to take place (it's a Dell; maybe it's just me but they do seem to have a habit of dumbing down BIOS options, compared with what I'll see on an "enthusiast" mobo anyway).  Aha: this seems to have been recently mentioned elsewhere at http://software.intel.com/en-us/articles/bios-preventing-access-to-qpi-performance-counters and e.g http://software.intel.com/en-us/forums/topic/385194 .

I'm guessing this likely has something to do with the lack of a "QPI Bandwidth" viewpoint in vTune too?

0 Kudos
Reply