BUS_TRAN_BURST.SELF ~4x of MEM_LOAD_RETIRED.L2_LINE_MISS

cfspc · ‎04-20-2009

Greetings,

I am trying to figure out the amount of memory bus traffic in
an application.

From http://assets.devx.com/goparallel/18027.pdf I thought that
BUS_TRAN_BURST.SELF (multiplied by 64) would bea good measure.
I also expected that this number would be within
2x of MEM_LOAD_RETIRED.L2_LINE_MISS (there are no RFOs
etc.). However, I see that BUS_TRAN_BURST.SELF is ~4 to 5 ofx
MEM_LOAD_RETIRED.L2_LINE_MISS. I have been trying to figure
out where thedifference comes from but I have not found a reasonable
explanation yet.

I also measuredL2_LD.SELF.DEMAND.MESI andL2_LD.SELF.ANY.MESI
and found that L2_LD.SELF.DEMAND.MESI is about half of
L2_LD.SELF.ANY.MESI and that L2_LD.SELF.DEMAND.MESI is about
double of BUS_TRAN_BURST.SELF.

The number of L2_M_LINES_OUT.SELF.ANY events is about 1.5 of
the number of MEM_LOAD_RETIRED.L2_LINE_MISS events.

Any help would be greatly appreciated.

Best regards,

Carlos

Thomas_W_Intel · ‎04-21-2009

Carlos,

So do the prefetchers kick in for your application? What does L2_LD.SELF.PREFETCH.MESI report? Have you tried the experiments with both L2 prefetchers disabled?

Kind regards
Thomas

cfspc · ‎04-21-2009

Hi Thomas,

Thank you very much for your suggestion.

I measured L2_LD.SELF.PREFETCH.MESI and L2_LD.SELF.PREFETCH.I_STATE.
Here is a table with the event counts in (GEvents 10^9):

BUS_TRAN_BURST.SELF 11.6
MEM_LOAD_RETIRED.L2_LINE_MISS2.0
BUS_TRAN_WB.SELF2.8
BUS_TRAN_IFETCH.SELF 0.7
L2_LD.PREFETCH.MESI 21.2
L2_LD.PREFETCH.I_STATE 6.0

So it does seem that prefetch is quite active. Does it make sense to say that
the L2_LD.PREFETCH.I_STATE events cause a similar number of cache

lines to be transported on the bus?

I cannot easily disable the prefetchers on the bios as this is running on
a remote server. Is there a way to programatically disable prefetching?

Best regards,

Carlos

TimP · ‎04-21-2009

There are many on-line posts about controlling hardware prefetch by MSR (requiring root privilege). It's well recognized that disabling adjacent cache line prefetch often would be an effective way of cutting down on untouched prefetched cache lines, yet it's not often a significant impact on performance. When strided prefetch kicks in, it typically brings in 2 cache lines beyond the end of a data stream. By proper affinity settings, for example, in the case of OpenMP schedule, you might arrange that most of those are cache lines already brought in for another data stream, and thus the effect may be negligible.

cfspc · ‎04-22-2009

Hi Tim,

Thank you very much for your post.I will try using http://etallen.com/msr.htmlto set the proper
MSR bits and will report the profile values after I repeat the experiments.I am not looking to
disable the prefetchers for performance but merely to see how much of the prefetching is wasted.
The application I am profiling is quite large has some parts where the access patterns are very "random"
and thoseparts are causing huge numbers of cache misses and bus traffic, even with only one thread.
Gathering this data should help push for a change and will help in our optimization efforts.

Best regards,

Carlos

cfspc · ‎04-22-2009

Greetings,

Unfortunately running the msr tool from http://etallen.com/msr.htmlis not working.
On the machines I have access to. I get

[root@...]# ./msr IA32_MISC_ENABLE.aclp_dis=1
msr: info: IA32_MISC_ENABLE.aclp_dis=1: fell back to numeric interpretation
msr: unable to write msr file at offset 0x000001a0; errno = 9 (Bad file descriptor)

(BTW the MSR module is compiled into the kernel).

I also tried just using wrmsr from asm/msr.h but that just segfaults.

I the tried wrmsr from a stap script ... and killed the machine ugh.

Can you point me to the appropriate way of manipulating MSR?

Best regards,

Carlos