Measuring Memory Bus Traffic

jqdu · ‎11-11-2008

I want to use Vtune to measure the traffic on memory bus when a program is running.

I searched this forum and found some clues, however, they are not that clear.

Probably, the following link gives some hint, and I hope somebody could say more about this topic.

http://software.intel.com/en-us/forums/showthread.php?t=44055

BTW, I'm playing with a box with 4 Xeon CPUs.

Thanks in advance.

Peter_W_Intel · ‎11-11-2008

I suggest you to readhttp://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/cycle_accounting.pdf

jqdu · ‎11-12-2008

Quoting - Zhen Yu Wang (Intel)

I suggest you to readhttp://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/cycle_accounting.pdf

Hi, this documents only covers Core 2 processors. I'm playing with a box with 4 Xeon processors. Some of the events are not available on my machine. Also, I'm not sure if two events have the same semantic because different naming methods are used for these two families.

Do you have any idea about Xeon processors ?

TimP · ‎11-12-2008

Quoting - jqdu

Hi, this documents only covers Core 2 processors. I'm playing with a box with 4 Xeon processors. Some of the events are not available on my machine. Also, I'm not sure if two events have the same semantic because different naming methods are used for these two families.

Do you have any idea about Xeon processors ?

Xeon is a more inclusive term, covering Core 2, as well as those which came before and after. Possibly, if you would be more specific, someone could help.

jqdu · ‎11-12-2008

Quoting - tim18

Xeon is a more inclusive term, covering Core 2, as well as those which came before and after. Possibly, if you would be more specific, someone could help.

Sorry for this. It's a Xeon Nocona processor.

The supported CPU Events for this platform are as below:

128-bit MMX Instructions Retired
1st Level Cache Load Misses Retired
2nd Level Cache Load Misses Retired
2nd Level Cache Read Misses
2nd-Level Cache Read References
2nd-Level Cache Reads Hit Exclusive
2nd-Level Cache Reads Hit Modified
2nd-Level Cache Reads Hit Shared
3rd-Level Cache Read Misses
3rd-Level Cache Read References
3rd-Level Cache Reads Hit Exclusive
3rd-Level Cache Reads Hit Modified
3rd-Level Cache Reads Hit Shared
64-bit MMX Instructions Retired
64k/4M Aliasing Conflicts
All UC Underway from The Processor (AT-E)
All UC from the Processor
All WC Underway from The Processor (AT-E)
All WC from the Processor
All WCB Evictions (TI)
All calls
All conditionals
All indirect branches
All returns
Branches Retired
Bus Accesses Underway from All Agents (AT-E)
Bus Accesses Underway from The Processor (AT-E)
Bus Accesses from All agents
Bus Accesses from the Processor
Bus Data Ready from the Processor (TI)
Bus Reads Underway from The Processor (AT-E)
Clockticks
DTLB Load Misses Retired
DTLB Load and Store Misses Retired
DTLB Page Walks (TI)
DTLB Store Misses Retired
IO Reads Chunk (BSQ) (AT-E)
IO Writes Chunk (BSQ) (AT-E)
ITLB Misses
ITLB Page Walks (TI)
Instructions Completed
Instructions Retired
Loads Retired
MOB Loads Replays (AT-E)
MOB Loads Replays Retired
Machine Clear Count
Memory Order Machine Clear
Mispredicted Branches Retired
Mispredicted calls
Mispredicted conditionals
Mispredicted indirect branches
Mispredicted returns
Non-Halted Clockticks
Non-prefetch Bus Accesses from the Processor
Non-prefetch Reads Underway from The Processor (AT-E)
Packed Double-precision Floating-point Streaming SIMD Extension Instructions Retired
Packed Single-precision Floating-point Streaming SIMD Extension Instructions Retired
Reads Invalidate Full - RFO (BSQ) (AT-E)
Reads Non-prefetch Full (BSQ) (AT-E)
Reads Non-prefetch from the Processor
Reads from the Processor
Scalar Double-Precision Floating-Point Streaming SIMD Extension Instructions Retired
Scalar Single-precision Floating-point Streaming SIMD Extension Instructions Retired
Self-Modifying Code Clear
Speculative Instructions Completed
Speculative Microcode uops
Speculative TC-built uops
Speculative TC-delivered uops
Speculative Uops Retired
Split Load Replays
Split Loads Retired (AT-E)
Split Store Replays
Split Stores Retired (AT-E)
Stalled Cycles of Store Buffer Resources (Non-Standard)
Stalls of Store Buffer Resources (Non-Standard)
Stores Retired
Streaming SIMD Extensions Input Assists (TI)
TC flushes
TC to ROM Transfers
Tagged Mispredicted Branches Retired
Trace Cache Build Mode
Trace Cache Deliver Mode
Trace Cache Misses
UC Reads Chunk (BSQ) (AT-E)
UC Reads Chunk Split (BSQ) (AT-E)
UC Reads Chunk Underway (BSQ) (AT-E)
UC Write Partial (BSQ) (AT-E)
Uops Retired
WB Writes Full Underway(BSQ) (AT-E)
WCB Full Evictions (TI)
Write WC Full (BSQ) (AT-E)
Write WC Partial (BSQ) (AT-E)
Write WC Partial Underway (BSQ) (AT-E)
Writes Underway from The Processor (AT-E)
Writes WB Full (BSQ) (AT-E)
Writes from the Processor
x87 Input Assists
x87 Instructions Retired
x87 Output Assists

jqdu · ‎11-12-2008

Hi, Tim

Now I'm using "Bus Data Ready from the Processor (TI)" to estimate the traffic on FSB.

This event is explained in the Help Doc of Vtune as follows:

"This event counts the number of front-side bus clocks that the bus is transmitting data driven by the processor core, including full reads|writes and partial reads|writes and implicit writebacks."

The following link also mentioned that this event could be used to estimate the traffic.

http://software.intel.com/en-us/forums/showthread.php?t=44398

"The Bus Data Ready event is most often used, in my limited experience, to determine the physical data rate. For Xeon/P4, this rate may be much higher than the useful data rate, particularly in cases of inefficient use of Read/Write Combining buffers. I don't know if you could correlate last level cache miss retired events with Bus Data Ready events, perhaps in cases where WCB use is efficient." by tim18

Here are some data from my experiment.

During 0.5s, Vtune collected around 50,000,000 events on a FSB with a base frequency of 200MHz, quad pumped to 800MHz.

Therefore, the theoretical FSB bandwidth should be 800M * 8 bytes/sec = 6.4GB/s

However, the largest traffic I can produce with my program is about 50M * 2 * 8 bytes/sec * 4 = 3.2GB/s. Here 4 is "quad pumping". The program I used is like Stream Bench, which creats and writes to a large array.

Could you/anybody let me know how to use "Bus Data Ready from the Processor (TI)" to estimate the traffic on FSB?

Or, are there any other events I can use to do this estimation?

Thanks!

TimP · ‎11-12-2008

As I recall, it was never possible to get more than 60% of theoretical bus bandwidth on Nocona, under realistic conditions. It looks like you already know more than I.