Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mfcking
Beginner
172 Views

Cache line size or cache sector size used by VTune?

Hi,
I noticed there are some confusing terms being used even in Intel P4 optimization manual. In some chapter, the manual says P4 Xeon's cache line size is 128B and fetch two 64B sectors each time when doing read and use onlyone 64B when doing write. However, in some chapters, it says the cache line size is still 64B. I'm not very clear about the relationship between cache line and sector. Could someone shed the light on this? And what's the correct cache line for P4 Xeon? Which size VTune will consider to use as the cache line size when do profiling?
Thanks,
L.Y.

Message Edited by mfcking@yahoo.com on 09-12-2005 12:13 PM

0 Kudos
9 Replies
TimP
Black Belt
172 Views

This feature of the Xeon design is usually referred to as Adjacent Sector Prefetch. Cache lines generally are brought into cache in pairs for read. For some purposes, it does behave as if the cache line size were doubled. Recent models have a facility for disabling this feature, but it is on by default.
mfcking
Beginner
172 Views

Hi Tim,
Thanks a lot for your reply. I'm still not very clear about the cache sector: when we say cache line, does that mean cache sector indeed? Why not just merge these two sectors into a single line instead of prefetecing two sectors?
L.Y.

Message Edited by mfcking@yahoo.com on 09-12-2005 12:17 PM

mfcking
Beginner
172 Views

Hi,
If the cache line size in Xeon is really 64B instead of 128B(one sector), then thelease significant 6 bits(0..5) instead of 7 bits(0..6) shouldnot be consideredin alias comparison. But that is not VTune does, right?
L.Y.

Message Edited by mfcking@yahoo.com on 09-12-2005 12:22 PM

TimP
Black Belt
172 Views

I don't know whether this particular effect of enabling and disabling Alternate Sector Prefetch has been fully checked out. It has to be disabled, if you wish to avoid adverse effects when one thread writes to one cache line of a pair, and the other thread reads from the other line of the pair, on a different logical processor. If you maintain the 128 byte address separation between reads and writes, you avoid this variety of cache aliasing (false sharing), even with ASP enabled. This is not the usual reason for disabling it.

Message Edited by tim18 on 09-12-2005 11:17 PM

mfcking
Beginner
172 Views

Hi Tim,
Do you know how to disable the ASP on Nocona? I want to use VTune to profile Memory Order Machine Clear event to see if ASP does has impact on the false sharing?
Thanks,
L.Y.
TimP
Black Belt
172 Views

Details may vary among distros. Root privilege is required. The msr module must be loaded
/sbin/insmod msr
and used to modify the contents of
/dev/cpu/0,...
When HT is active, any change to 0 applies also to 1 (logical siblings).
The device corresponding to the logical processor of interest is opened, lseek() used to get to file offset 0x1a0, and the bit
1 << 19 at that location is to be xor'd. A bit of black magic.

See /arch/[x86_64|i386]/kernel/msr.c and the attached program. This example program works separately on the Alternate Sector and Strided hardware prefetch of each logical CPU.

Message Edited by tim18 on 09-12-2005 11:15 PM

mfcking
Beginner
172 Views

Hi Tim,

Actually in theBIOS of supermicro board for Nocona, there is switch with which you can turn off this ASP. Just FYI.

L.Y.

Message Edited by mfcking@yahoo.com on 09-15-2005 12:30 PM

mfcking
Beginner
172 Views

Hi Tim,

The IPv4 forwarding performance on 2.6.9 Linux kernel lost 1% when ASP is disabled. Weird?

L.Y.

TimP
Black Belt
172 Views

The greatest reported gains for disabling ASP have been with applications which use all HyperThread logical processors, and don't use the other cache line of the pair very often. In my own testing, ASP has made little difference, partly because the cache lines are brought in anyway by the strided prefetch, and there is no reasonable possibility of turning that off without big performance losses. The differences tend to be concentrated in certain functions.
I haven't seen any reports of investigation of disabling ASP to deal with the kind of false sharing we discussed. A portable application should not depend on separate threads frequently reading and writing data closer together than 128 bytes. It looks like you don't have such a problem.
Reply