Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Cache Management


all modern CPUs have caches to speed up processing.

I made some simple benchmarks to see the impact. It is pretty surprising how things become much faster under high load while the programs became even slower under low CPU loads.

I think that the main problem is the place of the prefetch instruction relative to the actual use of the data.

Is there anybody who has some experience what a good method is to fill the cache?

There must be some distance between the prefetch instruction and the actual use of the data. The problem is, if the distance is to big, the data gets removed before being used and is reloaded again when really being used. If the distance is to small the program is through and will have to wait until the cache is filled.

0 Kudos
1 Reply
Unfortunately, I don't know if there is a "right" answer to this. It depends on the cache architecture, the data types and sizes, and the program layout/access patterns. Also, the way compilers optimize loops in which prefetching is possible comes into play.

For the Intel Compilers, with the vectorization/processor targetting switches (-QxW, -QaxW, etc..) for IA-32, and/or the high-level optimization (-O3 or HLO switch) for IA-32 and Intel Itanium processors, the compiler will either generate prefetches/streaming stores (IA-32) or layout data accesses such that hardware prefetching may kick in as much as possible. Experimentation is required in any case.

There's a good write up on IA-32 cache/memory utilization in the "IntelPentium4 and Intel Xeon Processor Optimization Reference Manual": There's a chapter on IA-32 cache utilization, and an Appendix on a theoritical/mathematical view of data prefetch distance.

The upshot is that due to all of the variables involved, experimentation is needed to determine what is best for your program.
0 Kudos