mapping between the L2 cache and RAM

svetlana_m · ‎05-02-2008

Hello,

I am writing an OpenMP program for Q6600 with 4 parallel threads.The threads are not using anyshared data sets, each thread is reading and writing to its own data set.I want to achieve the maximum performance withthe parallel program. I assume that with the 16-way setassociative L2 cache memory in Intel Q6600, the cache is divided into 4096blocks, with 16 cache lines in each block. I read that the main memory is also divided into blocks and every block of the memory is mapped to a block in the L2 cache.I was wondering if there is a more detailed information about this mapping between the cache and the memory.Is it correct to assume that that if two parallel threads are using data in the same block of the main memory and the threads are executing on cores that share aL2 cache, they will share only one block of the cache memory, not the entire 4MB L2 cache?How can I compute the size of theblocks for the main memory?

Thank you,

Svetlana Marinova

TimP · ‎05-02-2008

I don't see where you are going with this, so my answer may not be topical. According to what you said, the main performance issue would be to pin each thread consistently to the same core (and cache). Any recent linux kernel (e.g. since RHEL4.3) should accomplish this automatically. With Intel OpenMP library, KMP_AFFINITY environment variable settings should help, particularly on Windows. "Hard" coding affinity into an application may work, if it is the only application running.
Both cores have access to any cache line in L2, except when the other core takes write ownership. Each core has its own L1 and read combining buffers, and write combining buffers, so the access is indirect, except with "streaming stores." Memory is mapped in 4KB pages, unless huge pages are in use, as Java heap manager may do. DTLB keeps 256 recently used page mappings, so it is possible for the 2 cores to be competing for DTLB as well as L2 cache. If your application is such that each thread needs the entire L2 or DTLB, running 2 threads, each on a separate L2 (KMP_AFFINITY=scatter), may be useful.