I want to use Vtune to measure the traffic on memory bus when a program is running.
I searched this forum and found some clues, however, they are not that clear.
Probably, the following link gives some hint, and I hope somebody could say more about this topic.
BTW, I'm playing with a box with 4 Xeon CPUs.
Thanks in advance.
Hi, this documents only covers Core 2 processors. I'm playing with a box with 4 Xeon processors. Some of the events are not available on my machine. Also, I'm not sure if two events have the same semantic because different naming methods are used for these two families.
Do you have any idea about Xeon processors ?
Sorry for this. It's a Xeon Nocona processor.
The supported CPU Events for this platform are as below:
128-bit MMX Instructions Retired
1st Level Cache Load Misses Retired
2nd Level Cache Load Misses Retired
2nd Level Cache Read Misses
2nd-Level Cache Read References
2nd-Level Cache Reads Hit Exclusive
2nd-Level Cache Reads Hit Modified
2nd-Level Cache Reads Hit Shared
3rd-Level Cache Read Misses
3rd-Level Cache Read References
3rd-Level Cache Reads Hit Exclusive
3rd-Level Cache Reads Hit Modified
3rd-Level Cache Reads Hit Shared
64-bit MMX Instructions Retired
64k/4M Aliasing Conflicts
All UC Underway from The Processor (AT-E)
All UC from the Processor
All WC Underway from The Processor (AT-E)
All WC from the Processor
All WCB Evictions (TI)
All indirect branches
Bus Accesses Underway from All Agents (AT-E)
Bus Accesses Underway from The Processor (AT-E)
Bus Accesses from All agents
Bus Accesses from the Processor
Bus Data Ready from the Processor (TI)
Bus Reads Underway from The Processor (AT-E)
DTLB Load Misses Retired
DTLB Load and Store Misses Retired
DTLB Page Walks (TI)
DTLB Store Misses Retired
IO Reads Chunk (BSQ) (AT-E)
IO Writes Chunk (BSQ) (AT-E)
ITLB Page Walks (TI)
MOB Loads Replays (AT-E)
MOB Loads Replays Retired
Machine Clear Count
Memory Order Machine Clear
Mispredicted Branches Retired
Mispredicted indirect branches
Non-prefetch Bus Accesses from the Processor
Non-prefetch Reads Underway from The Processor (AT-E)
Packed Double-precision Floating-point Streaming SIMD Extension Instructions Retired
Packed Single-precision Floating-point Streaming SIMD Extension Instructions Retired
Reads Invalidate Full - RFO (BSQ) (AT-E)
Reads Non-prefetch Full (BSQ) (AT-E)
Reads Non-prefetch from the Processor
Reads from the Processor
Scalar Double-Precision Floating-Point Streaming SIMD Extension Instructions Retired
Scalar Single-precision Floating-point Streaming SIMD Extension Instructions Retired
Self-Modifying Code Clear
Speculative Instructions Completed
Speculative Microcode uops
Speculative TC-built uops
Speculative TC-delivered uops
Speculative Uops Retired
Split Load Replays
Split Loads Retired (AT-E)
Split Store Replays
Split Stores Retired (AT-E)
Stalled Cycles of Store Buffer Resources (Non-Standard)
Stalls of Store Buffer Resources (Non-Standard)
Streaming SIMD Extensions Input Assists (TI)
TC to ROM Transfers
Tagged Mispredicted Branches Retired
Trace Cache Build Mode
Trace Cache Deliver Mode
Trace Cache Misses
UC Reads Chunk (BSQ) (AT-E)
UC Reads Chunk Split (BSQ) (AT-E)
UC Reads Chunk Underway (BSQ) (AT-E)
UC Write Partial (BSQ) (AT-E)
WB Writes Full Underway(BSQ) (AT-E)
WCB Full Evictions (TI)
Write WC Full (BSQ) (AT-E)
Write WC Partial (BSQ) (AT-E)
Write WC Partial Underway (BSQ) (AT-E)
Writes Underway from The Processor (AT-E)
Writes WB Full (BSQ) (AT-E)
Writes from the Processor
x87 Input Assists
x87 Instructions Retired
x87 Output Assists
Now I'm using "Bus Data Ready from the Processor (TI)" to estimate the traffic on FSB.
This event is explained in the Help Doc of Vtune as follows:
"This event counts the number of front-side bus clocks that the bus is transmitting data driven by the processor core, including full reads|writes and partial reads|writes and implicit writebacks."
The following link also mentioned that this event could be used to estimate the traffic.
"The Bus Data Ready event is most often used, in my limited experience, to determine the physical data rate. For Xeon/P4, this rate may be much higher than the useful data rate, particularly in cases of inefficient use of Read/Write Combining buffers. I don't know if you could correlate last level cache miss retired events with Bus Data Ready events, perhaps in cases where WCB use is efficient." by tim18
Here are some data from my experiment.
During 0.5s, Vtune collected around 50,000,000 events on a FSB with a base frequency of 200MHz, quad pumped to 800MHz.
Therefore, the theoretical FSB bandwidth should be 800M * 8 bytes/sec = 6.4GB/s
However, the largest traffic I can produce with my program is about 50M * 2 * 8 bytes/sec * 4 = 3.2GB/s. Here 4 is "quad pumping". The program I used is like Stream Bench, which creats and writes to a large array.
Could you/anybody let me know how to use "Bus Data Ready from the Processor (TI)" to estimate the traffic on FSB?
Or, are there any other events I can use to do this estimation?