I have a loop that basically reads a large array and does some processing (on input data), then reads a small array and does some processing (on input data), and then repeats (using the same arrays but different input data).
Here, by large I mean too large to fit in cache, while small will fit in cache.
At the moment, when the large array is read, it kicks the small array out of cache. However, I expect performance could be improved if the small array could be kept within cache the whole time. So, is there a way to read the large array but not store it in cache, and then read the small array (which would then be kept within cache, since nothing would be kicking it out)?
I've tried using madvise(), but that doesn't seem to help.
The recommended way to tell the processor to minimize displacement of other data in the cache would be to use the PREFETCHNTA instruction when reading the large array (but not when reading the small array).
This is most easily accessed using the "_mm_prefetch(*addr, option)" intrinsic (described in the compiler manuals in a section called "Cacheability Support Intrinsics") with the "_MM_HINT_NTA" option. You will want to set the address to be some distance ahead of the accesses that occur in the loop body. I would probably start with setting the address to something like 64 cache lines ahead of the current pointer, but experimentation will be necessary.
On systems with 512-bit SIMD support, you can use "#pragma prefetch arrayname:4" to automagically generate non-temporal prefetches (option "4" is for data that will not be re-used).
All of these are hints to the hardware, and are treated differently in different processor generations.
Note that if the execution time is limited by the time required to load the large array, then one would not expect the time required to re-load the small array to be a significant adder to the total time.