I recently have a program where there's a lot of concurrent access (concurrent r/w and w/w) to common cache lines, on a machine with Intel Nehalem processor. The concurrent access created a lot of MOMC (memory order machine clear) overhead.
The "MOMC" event is described as "The Hyper-Threading system must ensure data consistency across the pipeline.Due to potential out of order execution, modified shared variables can lead to accessing memory in an incorrect order. If that happens, the entire pipeline of the machine has to be cleared. It introduces overhead of about 300-350 cycles for a MOMC event.
So my question is: Is there any way to turn off (or screen) such kind of event; or is there anyway for us to bypass it without changing the program structure (keeping the shard data structure) ?
I guess the overheads has little to do with MOMC, but instead with cache-coherence. Write sharing is extremely expensive and completely non scalable on all modern commodity architectures. You can't eliminate the costs just by turning off HT.
I think the description of MOMC might be a bit incomplete. Meaning the author of the description assumes the reader knows more information than presented in the MOMC description. I will try to make a blind assessment of the situation. Meaning a high likelihood that my assessment is incorrect, but none the less, this assessment may lead you to a solution to reduce MOMC events in your code.
Depending on the processor....
HT siblings (SMT within single core of multi-core processor) share L1 and L2 cache lines. They also share L3 but together with the other cores in the processor. *** Some processors havemultiple cores sharing L2 (Q6600). My assumption here is to try to reduce the amount of data shared by the HT siblings to the caches exclusively used by the core of the HT siblings. For Nehalem this would be L1 and L2.
Note, this does not necessarily mean your working set is limited to the size of L2. In some cases it may mean that as one HT sibling "plows" through L3 and RAM, that the other HT sibling must tag along re-using that data while the data remains in L1 or L2. Sometimes this may require each HT sibling to monitor progress of the other HT sibling and to restrict their advancement should the other HT sibling fall too far behind. Generally the fall behind will seldom happen because the 2nd HT sibling has faster access to the preloaded L1/L2 cache data. However, a fall behind can occur when the O/S preempts one of the HT sibling threads and/or one of the HT sibling threads experiences communication overhead to other cores(Interlocked... or spin waits).
I have nothing concrete to assert my assumptions are correct other than for working experience. If you look at my article http://www.quickthreadprogramming.com/Superscalar_programming_101_parts_1-5.pdf which you can also read in parts on ISN Parallel Programming Communities (click on Communities link at top of this page), you can observe performance chart data that seems to support my assertions.
In case a simplification helps: When one thread modifies part of a cache line which other threads are using, it may be necessary to write the entire modified cache line back to main memory and require all threads to get a fresh copy. This would be a MOMC event. In general, this is known as "false sharing." This article discusses false sharing at more length, with more explanation of what the programmer can do about it.
It does look like the official MOMC description has aged since the days of single core HT CPUs. Each new architecture changes the issues associated with false sharing, both those involving HT and those which don't.
There are more common, milder forms of false sharing, which don't trigger MOMC, but still may kill performance. For example, the normal setup of Intel CPUs enables alternate sector prefetch where each thread tries to keep a pair of cache lines up to date, even when it is currently using only one of them. The preferred way to avoid this is to avoid writing data stored as close as 128 bytes away from where another thread will be reading them. Referring back to the Q6600, mentioned by Jim and in that article, the effect is less serious when the cache line to be refreshed is needed only by the other thread which is running on the same L2 cache. On Nehalem, any copy of this cache line in L1 or L2 anywhere on the system would be invalidated and refreshed immediately by each thread upon touching an entry in the other cache line of the alternate sector pair. Needless to say, if the threads aren't on the same CPU, this involves going through main memory, even though there it doesn't stall everything as seriously as MOMC. In a well behaved case, like Jim's matrix multiplication, you have an ideal opportunity to avoid false sharing, and make the inner loop vectorizable, with each thread having exclusive use of the cache line it is modifying (except for the boundaries between data segments used by threads). By giving each thread a contiguous block of the array which is modified, with n threads you have only n-1 boundaries where threads might modify the same cache line (if the boundary between them falls in the middle of a line).