a) If we use gcc's __builtin_prefetch(addr, 1); to prefetch a cache line for write, what are the factors that will determine that this line will remain in cache. i.e., as long as the code is accessing it enough, is it guaranteed to stay in L1 cache?
b) If we use gcc's __builtin_prefetch(addr, 1, 1); to prefetch a cache line for read, what are the factors that will determine that this line will remain in cache. until accessed once? Also, if a writer writes to this line after the reader executes the prefetch, will the new data be prefetched automatically by the hardware?
c) I am assuming __builtin_prefetch uses the PREFETCHTx instructions. Can someone confirm please?
Thanks for your help.
There are no "guarantees" that a line will stay in cache for any length of time, whether the line is brought into the cache by a hardware prefetch, a software prefetch, or a normal load or store. This is the case for virtually any general-purpose processor designed in the last decade or two.
Software prefetch instructions typically do move data into some level of the cache hierarchy, and sometimes provide special behavior depending on some combination of the "temporal" hint(s) and the actual location and cache state of the cache line requested. Unfortunately the behavior is strongly implementation-dependent and is does not appear to be documented for recent Intel processors.
The Intel Optimization Reference Manual (document 248966-028, July 2013) dedicates much of Chapter 7 to a discussion of optimizing cache usage software prefetch instructions, but the details are only provided for the Pentium 4 processor! Similarly, the Intel Architecture SW Developer's Guide, Volume 2 (document 325383-047, June 2013) describes the behavior of the PREFETCH instructions only for the Pentium III and Pentium 4 processors. (There is a bit more information about the implementation of software prefetch and temporal hints on Xeon Phi, but that information is quite unlikely to tell us about how software prefetch is implemented on more modern cores.)
It would take a strong knowledge of microarchitecture and validated hardware performance counters to design a set of microbenchmarks that could be used to test various hypotheses about the exact operation of the prefetch instructions. I am not aware of any detailed analyses of how these are implemented in recent Intel processors -- but I would be happy to be corrected!
I don't know if it is a problem on all Ivy Bridge processors, but Agner Fog (www.agner.org) reports that while Sandy Bridge can execute two software prefetch instructions per cycle, Ivy Bridge can only execute one software prefetch every 43 cycles! This should be relatively easy to test, in case you are running on an Ivy Bridge processor.
The question about gcc builtin_prefetch seems a better question for gcc-help mailing list, once you have looked over gcc documentation and source code for the gcc version of interest, and can ask a more specific question, if you still have one. It looks like prefetcht0 is a different 3rd argument from the one you wanted to use, assuming your target architecture is one where the question is relevant. The cache level hints are interpreted differently by various CPU models, so there's a good chance it won't make a difference on a CPU you may be interested in.
As John indicated, interaction between software and hardware prefetch has been changed several times with new CPU introductions,
When data are written to a cache line, other copies of that cache line are invalidated. Are you asking at what point after a cache line is flushed would other software prefetched copies of it be replaced? I'm certainly not qualified to answer that, but I'd guess maybe not until accessed, on recent CPUs.
Thanks a lot gentlemen. I will follow up.
The gcc builtin_prefetch translates to
4008c0: 48 83 ec 08 sub $0x8,%rsp
4008c4: bf 00 04 00 00 mov $0x400,%edi
4008c9: e8 2a fe ff ff callq 4006f8 <_Znam@plt>
4008ce: 0f 18 08 prefetcht0 (%rax)
4008d1: b8 00 00 00 00 mov $0x0,%eax
4008d6: 48 83 c4 08 add $0x8,%rsp
4008da: c3 retq
4008db: 90 nop
The processor we are on is Intel E5-2690.
gcc source code looks as if prefetchnta would be a default; you would have to ask for t0 if you want that, but it may make no difference. You would still need to look at the source code for your choice of gcc version or test that version. If you are looking for feedback from gcc people, asking at gcc-help would make more sense.
Digging up this old thread to point out that godbolt.org is awesome for investigating these kinds of things.
Following this link you can see exactly how gcc, clang and icc compile the various flavors of __builtin_prefetch. The summary is that all compilers behave the same, with one exception as noted below:
BTW, the original poster seems to have his read/write hints and locality hints reversed: both in their position and values. 0 is the read hint and 1 is the write hint, and this hint is the first argument after the pointer. The locality hint is the second argument after the pointer. Both are optional and default to 0 (read) and 3 (highest locality) respectively.