I am using a function (compute_sad_16x16) that accesses two buffers, line by line, and finds the sum of absolute differences between them. compute_sad_16x16 is called by unaligned_cost which is called by set_init. I am using VTune Amplifier XE 2011 to analyze the cache performance, and the original function was giving me a 100% LLC cache miss. This was expected since my buffer stride is very high. So I decided to use the prefetch instruction to prefetch these two buffers and reduce cache misses.
I tried prefetching the buffers at various levels - just before compute_sad_16x16 was called, just before unaligned_cost was called, and just before set_init was called. I also tried all four modes of cache fetch - 0, 1, 2 and 3. I used VTune to analyze every time, and the percentage of LLC misses remained 100%. But I do see variations in the absolute numbers of LLC misses, LFB hits, L1 hits and L2 hits. The CPI also varies across profiles. The total time taken by the function does not decrease when compared to the original.
The PC we are using is i5 with Sandy Bridge architecture. The program runs on a single thread.
Please let us know exactly how to use the prefetch instruction - the modes and the level at which it should be called- or if it makes sense to use the prefetch instruction here at all.