Solved: OpenMP question: Should I use critical construct in my code?

zhouyi1999 · ‎01-18-2010

Hello,

I write some code with OpenMP. I use the critical construct in my code, so that a specific storage location will not be simultaneously updated by more than one thread.

I run my code on Supermicro X7QC3 ServerBoard, with 4 quad core intel xeon E7320 processors. The OS is linux. The compiler is gcc 4.3.0.

The main part of my code is following:

#pragma omp parallel for private(j, k, s, e)
for ( i = 0; i < N; i++)
{
...
for ( k = s ; k < e; k++)
{
j = n;
...
...
#pragma omp critical
{
m += ... ;
m += ... ;
}
}

}

But one of my friends says : "In Intel platforms, hardware maintains the memory coherence. It does not need atomic or critical in the code."

So I have this question that should I use critical construct for reduction on the array(m[]) in the code.

Thanks!

robert-reed · ‎01-18-2010

There's a difference between memory coherence and data races. Intel processors do have a cache coherence protocol that assures predictable updates of memory, but they can't protect programmers from their owncoding errors.

In the example you provided the omp parallel for will partition its workerthreadsacross the span of i so in the innermost block you could probably get away without the critical section for the references to m: each worker thread would operate on its own chunk of m[] and while there may be a little cache flipping where the thread partition boundaries don't line up with the cache line boundaries, there should be little problem.

However, the critical section also modifies m and all this code reveals about the value of j is that it is derived from n[]. If j never is in the range 0..N and the values of j pulled from n[] never repeat, you might not need it to be in the critical section either, but that's doubtful. More likely the js repeat or even cross over the range of the is, which would then require the critical section to avoid having the threads interfere with each other. It would be better if you could restructure the algorithm to avoid the need for the critical section since it serializes code execution and could have a severe impact on performance scaling.

View solution in original post

zhouyi1999 · ‎01-18-2010

I make my code only with compiler option ( -O3 ).

Thanks!

robert-reed · ‎01-18-2010

There's a difference between memory coherence and data races. Intel processors do have a cache coherence protocol that assures predictable updates of memory, but they can't protect programmers from their owncoding errors.

In the example you provided the omp parallel for will partition its workerthreadsacross the span of i so in the innermost block you could probably get away without the critical section for the references to m: each worker thread would operate on its own chunk of m[] and while there may be a little cache flipping where the thread partition boundaries don't line up with the cache line boundaries, there should be little problem.

However, the critical section also modifies m and all this code reveals about the value of j is that it is derived from n[]. If j never is in the range 0..N and the values of j pulled from n[] never repeat, you might not need it to be in the critical section either, but that's doubtful. More likely the js repeat or even cross over the range of the is, which would then require the critical section to avoid having the threads interfere with each other. It would be better if you could restructure the algorithm to avoid the need for the critical section since it serializes code execution and could have a severe impact on performance scaling.

zhouyi1999 · ‎01-19-2010

Hi,Robert Reed (Intel), Thank you very much!

There are morethan onethreads that will update the same m. So I think that I should use critical or atomic.

But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?

Thanks!

Michael_K_Intel2 · ‎01-19-2010

Quoting zhouyi1999

Hi,Robert Reed (Intel), Thank you very much!

There are morethan onethreads that will update the same m. So I think that I should use critical or atomic.

But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?

Thanks!

Hi,

If you are sure that multiple threads will access the same memory location through m then it is appropiate to place a critical or atomic construct around it. As the operation is a simple add on m, you could use atomic which will have a better performance.

While both can be used for mutual exclusion, there are key differences:

The critical construct can protect any sequence of code enclosed by curly braces. Internally, it takes a lock and ensures that only one thread enters the protected code sequence.
The atomic construct can only protect simple updates of memory locations (e.g. read, add, write). If two threads access a different memory location through m both may run without synchronizing. Only if two threads access the same memory location (read access m with the same value for j), atomic will ensure mutual exclusion.

So, my advice would be to use atomic instead of critical.

Cheers
-michael

robert-reed · ‎01-19-2010

Quoting zhouyi1999

But, Idoes not really understandcache coherence protocal, I only know that Intel use MESI protocol which can maintain the cache coherence. Ifone datahave been updated by core A, then the data on Core B will be notified.
Is that MESI can't do work formultiple sockets?

MESI is designed to handle multiple memory masters, but rather than me regurgitating details on MESI, I recommend you read the Wikipedia article. Michael has given you good advice on critical versus atomic.

jimdempseyatthecove · ‎01-20-2010

Consider changing

#pragma omp critical
{
m += ... ;
}

To:

m += ... ;
#pragma ompatomic
m += ... ;

or something like this:

#pragma omp parallel private(j, k, s, e)
{
mType* _m = (mType*)_alloca(sizeof(mType)*sizeFor_m); // must fit on stack
for(int q = 0; q < sizeFor_m; ++q)
_m = 0;
#pragma omp for
for ( i = 0; i < N; i++)
{
...
for ( k = s ; k < e; k++)
{
j = n;
...
...
m += ... ;
_m += ... ;
}
#pragma omp critical
{
for(int q = 0; q < sizeFor_m; ++q)
m += _m;
}
}

Jim Dempsey

zhouyi1999 · ‎01-20-2010

I know your means. Thank you very much!

zhouyi1999 · ‎01-20-2010

It's a good idea to use private array.
Thank you very much!

zhouyi1999

anilkatti · ‎02-21-2010

Hi Robert and Michael,

Though my question is not related to this post, I thought you guys might be in a good position to help me.

I am a graduate student at UT, Austin currently working on cache algorithms in multicore architecture. I am in need of a few data points from Intel for my paper. Could you please help me with these questions?

1. Does Intel ensure inclusion property at L2 and L3 levels of caches in its latest multicore processors?

2. How does Intel solve the cache coherency problem? I got to learn about using the QuickPath technology. But, not sure how exactly that helps.

Thanks,

Anil.

robert-reed · ‎02-21-2010

The last level cache (L3) in the Intel Core i7 is inclusive, as you might have been able to glean from a quick scan of Part 1 of the System Programming Guide (available here):

Intel Architecture uses the MESI protocol to ensure cache coherency, which is true whether you're on one of the older processors that use a common bus to communicate or using the new Intel QuickPath point-to-point interconnection technology. (The SPG also has a section on the MESI protocol as implemented in IA.)

anilkatti · ‎02-21-2010

Thank you Robert,

The information you provided was very valuable.

Do you have any insights on the implementation of inclusion property? Or can you point to to resources which talk about this?

As I understand block requests which incur cache miss at L1 go as requests to L2 and those which incur miss at L2 go as requests to L3. Is my understand correct w.r.t Intel architectures? (Say, Nehalem)

L3 is made inclusive in order to prevent L1 and L2 from wasting resources snooping the main memory. Right? When MESIF protocol takes care of coherency, is there any need for snooping at any level of cache?

If some one writes into a location in main memory, it should be L3 right? In that case, why should L3 snoop?

Thanks Again Robert!

- Anil.

robert-reed · ‎02-22-2010

Any implementation insights? The last level cache is inclusive: if a cache line lives in L1 or L2 on one of the cores, it will also have a place in the L3and yes, it helps to reduce snoop traffic. But not eliminate it--still need snoops from L3 if you've got multiple sockets. But, we are getting far afield from the topic of this thread. If you want to continue with these questions,I think you should start a thread with a more appropriate title.

anilkatti · ‎02-22-2010

Hi Robert,

I really appreciate your reponse. I am getting a clear picture now. I had created a new thread for my questions here http://software.intel.com/en-us/forums/showthread.php?t=72135 but, I couldn't get any response there. I guessed that you guys would have subscribed to this thread and you guys seemed super resourceful. That made me intrude this thread :)

Anyways, I'll post my response to your reply at the other location. Please help me out with a few more details.

Thanks,

Anil.