What is the nature and performance impact of cache line contention really?
I realized that I don't fully understand the nature of cache line contention performance problem, and failed to find a solid paper on the subject, maybe experts in this forum can help.
Note, I'm not talking about the false-sharing case, which is trivial to discover, estimate it's impact and fix. Rather I'm talking about regular spin-lock mutex vs. queuing spin-lock. Consider the regular spin-lock case first. Say thread1 unlocks the mutex (sets line in state E, RFOs neighbors and Modifies the line), then thread2 and thread3, that were spinning on the variable in this line would see it's Invalidated, would get a new line content and set it to Shared. Now both of them want to enter the critical region and they know that the mutex is unlocked, so they'll both need to try to modify the mutex variable. Where would the overallperformance impact come from? Both caches (for cores which run thread2 and thread3) would try to set Eclusive state for the cache line? Do they waste cycles doing this or only one "E" is allowed and the second one just sees it immediately and assumes an "I" for the line? In some other discussions that I've read on the subject I'm seeing that the performance problem comes from "line/snoop traffic". Is it a measurable value? In other words, if I know how many lock()-s/unlock()-s happend in one second and I know the average number of threads waiting on the mutex, would I be able to estimate this "traffic value"... it looks like I should be able to, at leas for MESI single-socket case, no?
Any help is greatly appreciated. And yes, I can be wrong about anything I wrote while describing the example, so don't hesitate to correct me or rewrite the whole scenario. Thanks!
P.S. I mentioned the queuing mutex and then never got back to it. The point was that when looking at the traffic, how different is spin-lock traffic value from the queuing spin-lock traffic value.
Not sure if I understand your questions / scenario correctly. Here are my points: 1. Whatever do you use Mutex, spinlock and queuing mutex, there is no change for cache access in logic. You can use VTune Amplifier XE's "Memory Access"type analysis to measure L2/LLC cache miss count. You can change algorithm or data layout, that helps to reduce cache misses. 2. Howeverdifferent implementation will impact on the performance. Mutex is big object of lockand it costs more system resource to protect big critical code area, spinlock consumes high CPU timeto protectone critical source line,but queuing mutex can reduce CPU consuming time. I suggest that you canuse tool's locksandwaits analysis to compare performance of different implementation (know "wait time" and "wait count" (count should be same)).