Re: Memory Order Machine Clear Detail Issues

daniel1103 · ‎02-24-2009

Hi, all

I want to know the false sharing miss in multithreaded programs(ex. splash2) recently.
There was a thread(2004) discussing the "Memory Order Machine Clear" event inForums before.
I also read "the Reduce False Sharing in .NET*" artical and help document in Vtune.

However,i couldn't figure out the meaning of "Memory Order Machine Clear".... :(
The false sharing isindependet datas in the same cache lineuesd manythreads..
I just know the false sharing is caused by cache coherence...
Why we saythe "Memory Order Machine Clear" is related false sharing...

Brief of my questions:
1. Is "Memory Order Machine Clear" equal to the false sharing miss ??
2. What is the meaning of "Memory Order Machine Clear"?

Isanyone could answer my questions?
Thank you very much :)

Regards
Dennise

TimP · ‎02-24-2009

The articles on Memory Order Machine Clear refer to the Intel NetBurst CPU architecture. Back then, this was a recommended VTune event for verification of performance problems related to false sharing.
More generally, if a thread updates a cached copy of a cache line (typically, a 64 byte section of virtual memory address space), all other cached copies of that cache linebecome invalid. This impacts performance when it results in unexpected cache misses. So, you would be looking for cache misses impacting performance of repeated access to the same cache line, particularly those which are associated with what Intel calls HITM events (cache hits on a modified cache line).
We have had 2 generations of Intel CPUs since NetBurst, each attempting to improve on the cache coherency schemes. MOMC might have been an inefficient way to implement coherency; Core i7 doesn't necessarily require taking the time to wipe out the cache physically, as it allows for seeing that there is another more up to date copy. Recent efforts in diagnosis of false sharing seemed to center on analyzing memory locality, which is supported by PTU for Core i7 (see WhatIf forum).

daniel1103 · ‎02-24-2009

Thanks for your reply very much :)

So, Is "memory Order Machine Clear" related false sharing miss and true sharing miss?
Memory Order Machine Clear False Sharing miss + True Sharing Miss ?

If it is correct, then I could get the false sharing miss roughly in the benchmark if ihaveknow the true shaing miss.

Daniel

TimP · ‎02-25-2009

MOMC, on those CPUs which are no longer in production, would confirm cache misses in the same loop as false sharing. If I get your meaning about true sharing miss, it would not include cache misses generated by hardware prefetch, in the case where those cache lines end up unused. You have also the opportunity to measure all cache misses (including duplicates), and cache misses retired, which seem to relate more directly to the true and false misses you are discussing.

Dmitry_Vyukov · ‎02-28-2009

From the name and brief description, I understand MOMC as follows: core executes load operation, but before retirement of the load another core changes cache-line where the value resides. So core has to flush the whole pipeline and re-execute all instructions.

Term False-sharing in most cases is wrong and bad term. Processors really have no means to distinguish false sharing and true sharing, and penalize the former and do not penalize the latter. In most cases false sharing and true sharing have exactly the same consequences. In some cases it's even difficult to say whether it's false sharing or true sharing. Consider - array of elements, some threads iterate over whole array, and some threads update individual elements. If we will place several elements into same cache-line - is it false-sharing?

So what you must try to eliminate - is just sharing (not on the source code level, but on the physical level).

Dmitry_Vyukov · ‎02-28-2009

Quoting - daniel1103

If it is correct, then I could get the false sharing miss roughly in the benchmark if ihaveknow the true shaing miss.

I am curious as to how are you going to measure true sharing miss?

mattb348 · ‎03-03-2009

Quoting - Dmitriy Vyukov

I am curious as to how are you going to measure true sharing miss?

I'm was actually wondering the same thing myself; care to comment on this?

- Matt

TimP · ‎03-04-2009

Do you mean race detection? Intel Thread Checker is intended to find potential race conditions. So, if you found cache line sharing stalls matching a known race condition, you would have confirmation that it is both a danger and a performance issue.
The memory analysis in PTU also should assist diagnosis of multiple threads hitting the same address or the same cache line.

Dmitry_Vyukov · ‎03-04-2009

Quoting - tim18

Do you mean race detection? Intel Thread Checker is intended to find potential race conditions. So, if you found cache line sharing stalls matching a known race condition, you would have confirmation that it is both a danger and a performance issue.
The memory analysis in PTU also should assist diagnosis of multiple threads hitting the same address or the same cache line.

I don't mean race detection. I mean true sharing, i.e. sharing of data, but not false sharing.

TimP · ‎03-04-2009

Quoting - Dmitriy Vyukov

I don't mean race detection. I mean true sharing, i.e. sharing of data, but not false sharing.

So you want to see VTune events for read-only shared data? Not recently-modified, such that HITM events would occur? PTU memory locality, for Core i7, seems to fit, but I'm not seeing your goals.

Dmitry_Vyukov · ‎03-05-2009

Quoting - tim18

So you want to see VTune events for read-only shared data? Not recently-modified, such that HITM events would occur? PTU memory locality, for Core i7, seems to fit, but I'm not seeing your goals.

Why only read-only? Isn't plain mutex read-only true sharing?
Just to make it explicit, I am only curious how it's possible to estimate false sharing as false_sharing = total_sharing - true_sharing. I can imagine how to measure total_sharing, for example as total number of cache line transfers between cores. But how to measure true_sharing?
You've provided some hook - memory analysis of PTU may show that threads access same cache line but different addresses. But I don't think that it's correct gauge of false sharing. Consider, data structure protected by mutex, both contained at the same cache line. While current owner of the mutex works with data structure, other threads access mutex. It's definitely not false sharing.

Dmitry_Vyukov · ‎03-05-2009

Quoting - tim18

So you want to see VTune events for read-only shared data? Not recently-modified, such that HITM events would occur? PTU memory locality, for Core i7, seems to fit, but I'm not seeing your goals.

Why only read-only? Isn't plain mutex read-only true sharing?
Just to make it explicit, I am only curious how it's possible to estimate false sharing as false_sharing = total_sharing - true_sharing. I can imagine how to measure total_sharing, for example as total number of cache line transfers between cores. But how to measure true_sharing?
You've provided some hook - memory analysis of PTU may show that threads access same cache line but different addresses. But I don't think that it's correct gauge of false sharing. Consider, data structure protected by mutex, both contained at the same cache line. While current owner of the mutex works with data structure, other threads access mutex. It's definitely not false sharing.