Showing results for 
Search instead for 
Did you mean: 

Size of data fetched with the instruction prefetch

I would like to know the size of data fetched when I do a prefetch on an pointer with the instruction PREFETCHh (prefetcht0, prefetcht1, prefetcht2 or prefetchnta).
In the Intel 64 an IA-32 Architectures Software Developer's Manual, I can read this :
"These instructions fetch 32 aligned bytes (or more, depending on the implementation) containing the addressed byte to a location in the cache hierarchy specified by the temporal locality hint."
So, the minimum size of data fetched is 32 bytes but how to know the real size according the implementation ?
I need to know this because I work on an image and I want to prefetch several pixels around another pixel, so I need to know how many prefetch instruction must I do.
Thanks, nicolas
Example :
0 Kudos
4 Replies

I believe all modern processors use 64-byte cache lines That is the granularity of a prefetch.
New Contributor I

If you take a look at the cpuid instruction documented in the Software Developer's Manual ( you will find that Intel's answer is rather complicated. The prefetch size may be 32, 64, or 128 Bytes. And the only correct way to know is to find out via cpuid. cpuid itself is crazy - I just spent 5 hours to implement the latest spec, and of course the overlap between AMD and Intel in that regard is marginal.

Anyway, all the systems I have access to, be that Intel or AMD, have answered with a prefetch size of 64 Bytes. As far as I know only some Intel CPUs of family 15 actually had a prefetch size of 128 Bytes. But I never had one of those. And 32 Bytes is probably Pentium 2/3 times...
Black Belt

On some CPUs (e.g. early P4), when alternate sector prefetch is active, prefetching a cache line could trigger the prefetch of the companion cache line, or 128 bytes in all. The linking of software and hardware prefetch was removed in more recent CPUs, AFAIK.

Thanks for your answer,
With this small program, I confirm that the prefetch instruction fetch 64 bytes in the cache.
The program :
int main(int argc, char *argv[])
int input = 0x2, eax, ebx, ecx, edx;
asm ("movl %0, %%eax;"::"r"(input));
asm ("cpuid;");
asm ("movl %%eax, %0;":"=r" (eax));
asm ("movl %%ebx, %0;":"=r" (ebx));
asm ("movl %%ecx, %0;":"=r" (ecx));
asm ("movl %%edx, %0;":"=r" (edx));
printf("eax = 0x%08.8X, \nebx = 0x%08.8X, \necx = 0x%08.8X, \nedc = 0x%08.8X \n", eax, ebx, ecx, edx);
return 0;
The result on my X5670 is :
eax = 0x55035A01,
ebx = 0x00F0B2FF, <== bits 23-16 = F0 = Prefetch : 64-Byte prefetching*
ecx = 0x00000000,
edc = 0x00CA0000
* :Table 3-25. Encoding of CPUID Leaf 2 Descriptors in the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 2A: Instruction Set Reference, A-M)
Thanks, nicolas