I am trying to perform a parallel surface extraction from volume data and need to fit a data structure which is a 3D grid of values in cache. My machine has 4 Xeon 7500 processors. 8 cores per each CPU. How much L2 cache do I have per each core? Is there an instruction to pre-fetch data in L2?
I guess this is a Nehalem EX machine. Here's an article about the cache system of that machine, (except that 3MB per core was available only for the top price model). If your code is so poorly organized that you need the entire data set local to each of the 32 L2 caches, you are doomed to inefficient performance. You should optimize so that you need only a fraction of the data local to each of the 4 L3 caches, and each core is working on only a fraction of that. Then, the default automatic prefetcher settings would take care of prefetching in most situations. With a lot more work, which would be wasted if you haven't accomplished the basics, you could turn off the prefetchers in the BIOS (if this machine isn't shared with other users) and use mm_prefetch intrinsics or prefetch pragmas in your source code. The Intel compiler opt-prefetch option works along with normal BIOS prefetch options to speed up a few problems.