- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody !
My application is well balanced in terms of computation power and external memory access, however since it works on data chunks of 8KB approximately, I didn't find any covenient way of prefetching/streaming such a large amount of data (using the regulat technique ofdata streaming concatenation is very unflexible, since I don't know in advance the data size).
Can you direct me tofind any instruction/instructions group which prefetch an interchangable amount of data, at the order of a few KB (up to 16KB).
Can you direct me tofind any instruction/instructions group which prefetch an interchangable amount of data, at the order of a few KB (up to 16KB).
Thanks , Boaz
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see what is inflexible about the standard prefetching techniques, as they extend to any size of data blocks. Maybe you could give an example, and specify an architecture.
On P4/Xeon architectures, hardware prefetch is interrupted when crossing a page boundary. That has no great effect on performance, if all data from each cache line are used. There would be no apparent reason for using software prefetch, unless you had to disable hardware prefetch. Effect of software prefetch on hardware prefetch varies between models.
On Itanium, software prefetch is engaged automatically at -O3. You could check the code generated by the compiler to see how it is done. The objective is to issue prefetch instructions several hundred clock cycles before the data access. You could place your data blocks so that they fit within a single page, if you had a reason to do so. At the instruction level, you have a choice of cache level.
Prefetch intrinsics are written up in the Intel compiler documentation, and the instructions are described in the chip architecture documents. If those details are more than you care to know, perhaps the default techniques are sufficient.
On P4/Xeon architectures, hardware prefetch is interrupted when crossing a page boundary. That has no great effect on performance, if all data from each cache line are used. There would be no apparent reason for using software prefetch, unless you had to disable hardware prefetch. Effect of software prefetch on hardware prefetch varies between models.
On Itanium, software prefetch is engaged automatically at -O3. You could check the code generated by the compiler to see how it is done. The objective is to issue prefetch instructions several hundred clock cycles before the data access. You could place your data blocks so that they fit within a single page, if you had a reason to do so. At the instruction level, you have a choice of cache level.
Prefetch intrinsics are written up in the Intel compiler documentation, and the instructions are described in the chip architecture documents. If those details are more than you care to know, perhaps the default techniques are sufficient.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page