Hello,
I write some code with OpenMP. I use the critical construct in my code, so that a specific storage location will not be simultaneously updated by more than one thread.
I run my code on Supermicro X7QC3 ServerBoard, with 4 quad core intel xeon E7320 processors. The OS is linux. The compiler is gcc 4.3.0.
The main part of my code is following:
#pragma omp parallel for private(j, k, s, e)
for ( i = 0; i < N; i++)
{
...
for ( k = s ; k < e; k++)
{
j = n
...
...
#pragma omp critical
{
m += ... ;
m
}
}
}
But one of my friends says : "In Intel platforms, hardware maintains the memory coherence. It does not need atomic or critical in the code."
So I have this question that should I use critical construct for reduction on the array(m[]) in the code.
Thanks!
Link Copied
There's a difference between memory coherence and data races. Intel processors do have a cache coherence protocol that assures predictable updates of memory, but they can't protect programmers from their owncoding errors.
In the example you provided the omp parallel for will partition its workerthreadsacross the span of i so in the innermost block you could probably get away without the critical section for the references to m: each worker thread would operate on its own chunk of m[] and while there may be a little cache flipping where the thread partition boundaries don't line up with the cache line boundaries, there should be little problem.
However, the critical section also modifies m
I make my code only with compiler option ( -O3 ).
Thanks!
There's a difference between memory coherence and data races. Intel processors do have a cache coherence protocol that assures predictable updates of memory, but they can't protect programmers from their owncoding errors.
In the example you provided the omp parallel for will partition its workerthreadsacross the span of i so in the innermost block you could probably get away without the critical section for the references to m: each worker thread would operate on its own chunk of m[] and while there may be a little cache flipping where the thread partition boundaries don't line up with the cache line boundaries, there should be little problem.
However, the critical section also modifies m
Hi,Robert Reed (Intel), Thank you very much!
There are morethan onethreads that will update the same m
But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?
Thanks!
Hi,Robert Reed (Intel), Thank you very much!
There are morethan onethreads that will update the same m
But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?
Thanks!
Hi,
If you are sure that multiple threads will access the same memory location through m
While both can be used for mutual exclusion, there are key differences:
So, my advice would be to use atomic instead of critical.
Cheers
-michael
But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?
MESI is designed to handle multiple memory masters, but rather than me regurgitating details on MESI, I recommend you read the Wikipedia article. Michael has given you good advice on critical versus atomic.
Consider changing
#pragma omp critical
{
m
}
To:
m += ... ;
#pragma ompatomic
m
or something like this:
#pragma omp parallel private(j, k, s, e)
{
mType* _m = (mType*)_alloca(sizeof(mType)*sizeFor_m); // must fit on stack
for(int q = 0; q < sizeFor_m; ++q)
_m = 0;
#pragma omp for
for ( i = 0; i < N; i++)
{
...
for ( k = s ; k < e; k++)
{
j = n
...
...
m += ... ;
_m
}
#pragma omp critical
{
for(int q = 0; q < sizeFor_m; ++q)
m += _m
;
}
}
Jim Dempsey
I know your means. Thank you very much!
It's a good idea to use private array.
Thank you very much!
zhouyi1999
Hi Robert and Michael,
Though my question is not related to this post, I thought you guys might be in a good position to help me.
I am a graduate student at UT, Austin currently working on cache algorithms in multicore architecture. I am in need of a few data points from Intel for my paper. Could you please help me with these questions?
1. Does Intel ensure inclusion property at L2 and L3 levels of caches in its latest multicore processors?
2. How does Intel solve the cache coherency problem? I got to learn about using the QuickPath technology. But, not sure how exactly that helps.
Thanks,
Anil.
The last level cache (L3) in the Intel Core i7 is inclusive, as you might have been able to glean from a quick scan of Part 1 of the System Programming Guide (available here):
Intel Architecture uses the MESI protocol to ensure cache coherency, which is true whether you're on one of the older processors that use a common bus to communicate or using the new Intel QuickPath point-to-point interconnection technology. (The SPG also has a section on the MESI protocol as implemented in IA.)
Thank you Robert,
The information you provided was very valuable.
Do you have any insights on the implementation of inclusion property? Or can you point to to resources which talk about this?
As I understand block requests which incur cache miss at L1 go as requests to L2 and those which incur miss at L2 go as requests to L3. Is my understand correct w.r.t Intel architectures? (Say, Nehalem)
L3 is made inclusive in order to prevent L1 and L2 from wasting resources snooping the main memory. Right? When MESIF protocol takes care of coherency, is there any need for snooping at any level of cache?
If some one writes into a location in main memory, it should be L3 right? In that case, why should L3 snoop?
Thanks Again Robert!
- Anil.
Hi Robert,
I really appreciate your reponse. I am getting a clear picture now. I had created a new thread for my questions here http://software.intel.com/en-us/forums/showthread.php?t=72135 but, I couldn't get any response there. I guessed that you guys would have subscribed to this thread and you guys seemed super resourceful. That made me intrude this thread :)
Anyways, I'll post my response to your reply at the other location. Please help me out with a few more details.
Thanks,
Anil.
For more complete information about compiler optimizations, see our Optimization Notice.