I'm wondering at which cache level does Ivy Bridge manage coherence? I could not find a document explaining this (or I couldn't find one that I in fact understood). L1 and L2 caches are private to each core right? So is it done a L3?
Any help will be apreciated. Thanks.
I suppose it's basically the same since Nehalem. Lots of papers, not all entirely agreeing. Yes, a hit modified on the same CPU has to be resolved among cores via L3. This seems to have become relatively expensive on the dual 12-core Ivy Bridge.
Hello Tim, thank you for your comments.
What are the pros of resolving coherence at L3? If you are worried about sharing data among cores I guess you would like to do it the most efficient way as possible (this I mean at L1/L2 level). Isn't it a little paradoxal creating a processor with many cores (so encouraging the use of multithreaded apps) with a "slow coherence bus" between cores?
Please correct me If I'm mistaken. I'm not a "black belt" yet. =p
I suppose the architectural designers take a much wider view of considerations and specific application traces in view than we can consider here. The switch to inclusion of L3 was made when the CPUs went beyond 4 cores, and now CPUs with more than 32 cores again have no L3, although the L2 are organized in a ring which seems to have some resemblance to L3.
There was also a switch when L3 was introduced from the write combine buffer scheme where L2 (shared between 2 cores, possibly 4 thread contexts) had to be updated before L1, to the fill buffer scheme where L1 is updated first and is exclusive of L2 (the cache lines in current use aren't in L2 until evicted from L1).
Are you proposing that L3 should not be updated when a hit modified occurs, at least not until the most recent data have to be evicted from L2? I'd speculate this might make sense in a situation where it's unusual that more than one cache needs updating, but that might be a situation where L3 isn't advantageous at all.
These forums are more software than hardware oriented. I think I'm not the only one who tends to restrict interest in hardware to the aspects which we need to consider when working on software.
Hi Tim, thanks for you comments.
I'm not proposing anything. I'm parallelizing an application and seems that the communication/synchronization throught L3 is the main bottleneck, this is why I'm interested in knowing at which level coherence is done and why it is so. I've no particular interest in the hardware beyond the point which I need to know in order to improve the performance of my application.
You were very helpful.
I think that L1/L2 is supposed to be shared by some process threads which run on the same core.
When using HyperThreading, L1 is supposed to be shared on a demand basis (the cache entries of one thread are tagged and hidden from the other). I suppose there should be some resolution locally in L2 of coherence of data used only by the local threads, but I haven't seen this discussed.
It was stated in some publication that usage of L2 was tuned for multithreaded process.Now regarding the move toward the L3 cache I suppose that it was dictated probably by multithreaded applications(processes) sharing the same L2 and associativity issues.