Would I be correct in saying that, if I were to distribute my data structure in memory properly I could use multiple cache lines at the same time? Would it happen that each of these cache lines pre-fetches memory as well? I might have misunderstood something, but it seems to me that if I were to use memory regions properly I could effectively have multiple cache lines in the processor at the same time, suggesting that I could use multiple cache lines in my algorithm to increase performance.
Is my understanding accurate?
On pg. 113 of "The Software Optimization Cookbook" it says:
"Each row has been designed to hold unique 2-kilobyte or 4-kilobyte blocks of memory, based upon the row number and the memory address, to speed up the time it takes to determine a cache hit. If all 256 lines could hold any memory location, the processor would have to test all 256 lines to determine a cache hit. But, by assigning each row a specific range of possible addresses, the processor tests only eight lines to determine a cache hit."
This lead me to believe that the processor works the way I think it does...
Also, Raf, come onto #tbb sometime, it would be interesting to have some discussions with you!
I noticed I should have been more precise in my previous post: "I would think that any cache line/sector can be used for any part of memory" was not meant to imply a fully associative cache, but any
But if any of that is wrong (please check for yourself!), I would very much like to be educated. And thanks for the invitation, but this forum should be enough for me (provided it doesn't get enough of me!).
(Added) Oops! Any other mistakes?
Thanks everyone for answering my questions, and helping out while I learn these low-level details.
Yes, as you may now realize, your attempts to squeak some extra performance out of the cache may have, depending on how regular was your dispersal, caused an increase of contention between the cache lines, if those locations in VM happened to resolve to the same tag, thus filling up the ways faster.
Here's a real question, directly relevant to how programmers should reason about locality issues: does false sharing happen between sectors (worse because sectors are bigger) or between lines (same sector happily residing in more than one processor cache, one line dirty in one processor and another line dirty in another processor), and is that always so or dependent on how cache coherency logic is implemented? I didn't read those documents in detail, and so there may be an explicit statement that I have missed, but there are a few places that would lead me to suspect that false sharing could be between 64-byte lines, not between 128-byte sectors, in contradiction with what I think I have heard from TBB. The clearest may be in the second document, at 2.2 C).
Nicest new idea I learned from these documents (the second one, actually): decoupled sectored caches.