- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
all modern CPUs have caches to speed up processing.
I made some simple benchmarks to see the impact. It is pretty surprising how things become much faster under high load while the programs became even slower under low CPU loads.
I think that the main problem is the place of the prefetch instruction relative to the actual use of the data.
Is there anybody who has some experience what a good method is to fill the cache?
There must be some distance between the prefetch instruction and the actual use of the data. The problem is, if the distance is to big, the data gets removed before being used and is reloaded again when really being used. If the distance is to small the program is through and will have to wait until the cache is filled.
Erich
all modern CPUs have caches to speed up processing.
I made some simple benchmarks to see the impact. It is pretty surprising how things become much faster under high load while the programs became even slower under low CPU loads.
I think that the main problem is the place of the prefetch instruction relative to the actual use of the data.
Is there anybody who has some experience what a good method is to fill the cache?
There must be some distance between the prefetch instruction and the actual use of the data. The problem is, if the distance is to big, the data gets removed before being used and is reloaded again when really being used. If the distance is to small the program is through and will have to wait until the cache is filled.
Erich
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, I don't know if there is a "right" answer to this. It depends on the cache architecture, the data types and sizes, and the program layout/access patterns. Also, the way compilers optimize loops in which prefetching is possible comes into play.
For the Intel Compilers, with the vectorization/processor targetting switches (-QxW, -QaxW, etc..) for IA-32, and/or the high-level optimization (-O3 or HLO switch) for IA-32 and Intel Itanium processors, the compiler will either generate prefetches/streaming stores (IA-32) or layout data accesses such that hardware prefetching may kick in as much as possible. Experimentation is required in any case.
There's a good write up on IA-32 cache/memory utilization in the "IntelPentium4 and Intel Xeon Processor Optimization Reference Manual": ftp://download.intel.com/design/Pentium4/manuals/24896607.pdf. There's a chapter on IA-32 cache utilization, and an Appendix on a theoritical/mathematical view of data prefetch distance.
The upshot is that due to all of the variables involved, experimentation is needed to determine what is best for your program.
For the Intel Compilers, with the vectorization/processor targetting switches (-QxW, -QaxW, etc..) for IA-32, and/or the high-level optimization (-O3 or HLO switch) for IA-32 and Intel Itanium processors, the compiler will either generate prefetches/streaming stores (IA-32) or layout data accesses such that hardware prefetching may kick in as much as possible. Experimentation is required in any case.
There's a good write up on IA-32 cache/memory utilization in the "IntelPentium4 and Intel Xeon Processor Optimization Reference Manual": ftp://download.intel.com/design/Pentium4/manuals/24896607.pdf. There's a chapter on IA-32 cache utilization, and an Appendix on a theoritical/mathematical view of data prefetch distance.
The upshot is that due to all of the variables involved, experimentation is needed to determine what is best for your program.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page