Data race problem

Anupam_Dev1 · ‎04-10-2010

#include
#include
#include

int main()
{omp_set_num_threads(2);
int i=0;
int a=0;
#pragma omp parallel private (i) shared(a
{
#pragma omp for
for(i=0;i<900000000;i++)
{
a++;

}

} printf("%d",a);
getch();
}
the result should be 900000000.
but...
evertime i run this program i get a different result. When i checked using the Intel Parallel Inspector, it showed me the data race problem at a++ . please help...

when the loop condition is a small integer, i get correct result.
any comments will be appreciated...

Dmitry_Vyukov · ‎04-10-2010

Well, yes, it's a plain data race on the variable a, 2 threads access it without any synchronization/mutual exclusion, it's prohibited by POSIX/Win32/C1x/C++0x.

It won't scale, but will do the thing:

#pragma omp for
for(i=0;i<900000000;i++)
{
#pragma omp atomic
a++;
}

Dmitry_Vyukov · ‎04-10-2010

And regarding the following moment:
> when the loop condition is a small integer, i get correct result.

When loop condition is smaller first thread finishes it's work before second thread get to it, so 2 threads actually do not run concurrently, so the race goes away (of course, you still can get the race on each next run depending on stars disposition).

jimdempseyatthecove · ‎04-11-2010

Consider using the reduction clause

#include
#include
#include

int main()
{
omp_set_num_threads(2);
int i=0;
int a=0;
#pragma omp parallel private (i) reduction(+:a)
{
#pragma omp for
for(i = 0; i < 900000000; i++)
{
a++;
}

} printf("%d",a);
getch();
}

Using reduction, each thread receives an initialized to 0local copy of the spedified variable. The reduction clause contains an operator, in this case "+", that is to be used to join the local copies together upon exiting of the parallel region (subregion), the result being placed into the variable visible to the master thread of the region containing the reduction clause (IOW in the scope immediately outside the parallel region containing the reduction clause.

In the above example, the initialization of a=0 is not required due to each thread initializing its local copy of a to 0 at start of parallel region with reduction on a, however it makes clear your intentions.

** note ** in the above program sequence, should both (all) threads be perfectly synchronized, then no thread will observe variable a exceeding (900000000/nThreads)+1. Upon exit of the parallel region, the reduction operator, + in this case, is applied to the thread local copies of a, to produce the proper sum in the outer scope a.

With the reduction operator, and 2 threads, your programwill incur the overhead of2 atomic additions. Whereas using the explicit #pragma omp atomic statement your program will incure the overhead of900000000atomic additions.

Use #pragma omp atomic when you have few such additions - or - when all threads need to keep track of the sum total.

Jim Dempsey

Grant_H_Intel · ‎04-12-2010

Quoting jimdempseyatthecove

Consider using the reduction clause

#include
#include
#include

int main()
{
omp_set_num_threads(2);
int i=0;
int a=0;
#pragma omp parallel private (i) reduction(+:a)
{
#pragma omp for
for(i = 0; i < 900000000; i++)
{
a++;
}

} printf("%d",a);
getch();
}

...

In the above example, the initialization of a=0 is not required due to each thread initializing its local copy of a to 0 at start of parallel region with reduction on a, however it makes clear your intentions.

** note ** in the above program sequence, should both (all) threads be perfectly synchronized, then no thread will observe variable a exceeding (900000000/nThreads)+1. Upon exit of the parallel region, the reduction operator, + in this case, is applied to the thread local copies of a, to produce the proper sum in the outer scope a.

...
Jim Dempsey

I must disagree with Jim on one point: It is absolutely necessary to initialize the outer variable "a" before the parallel region. The reason is actually given in Jim's post above. The private copies of "a" (which are automatically initialized to zero) will all be added to the outer variable "a" at the end of the parallel region. If the outer variable "a" is uninitialized, the result value of "a" will be undefined.

A program with slightly better style would be the following:

#include
#include
#include
int main()
{
omp_set_num_threads(2);
int i=0;
int a=0;

#pragma omp parallel for private (i) reduction(+:a)
for(i = 0; i < 900000000; i++)
{
a++;
}
printf("%d",a);
getch();
}

Note here, the parallel and for directives are combined since there is no code in between them. The reduction works the same way inthis case, but the implicit barrier at the end of the for loop will have a better chance to be optimized away since it is necessarily followed immediately by the parallel region join barrier using this OpenMP pragma.

jimdempseyatthecove · ‎04-12-2010

Grant,

If your statement is true (must initialize out of scope variable being reduced) then there is a bug in the OpenMP implementation. Consider the following:

Should the master thread directly use the variable a from out of scope (as opposed to local copy of a as do the other thread team members), then the master threads updating of a can get boinked by other threads finishing first (they will perform atomic reduction +, while master not performing atomic ++). Therefore, to inhibit this from happening, the master thread must also use a local copy of the variable(s) to be reduced. Since these are local variables the auto init to 0 applies.

When the variable is NOT a reduction variable, AND when the variable is PRIVATE, then the outer scope variable is used by the master thread, and private copies are used by other threads. The other threads receive either junk or out of scope value for a should COPYIN be in effect.

Jim

Grant_H_Intel · ‎04-12-2010

Jim,

I never said the master thread doesn't get its ownprivate (thread-specific) copy of a. What I said is that the final step of the reduction must update the shared (outer scope) copy of a.The implementationmust do this not by copying the summed value of all private copies of a, but instead, it must add the final sum of the private copies of a to the original value of the shared copy of a. So if there are n threads, there are n+1 copies of a. Ifthe code doesnot initialize the shared copy of a explicitly, then it may magically end upgetting zero-initialized, but that is not guaranteed by the base language specifications. In general, the program has unspecified behavior if the outer scope copy of a is not initialized, added to, and then printed.

This method for reduction allows the outer copy of a to maintain a "running accumulation" if there were more than one loop or parallel region involved. The kind of reduction where the outer copy of a just gets the final value of the code inside the reduction construct (omp for loop) is functionally inferior, and has never been in any of the OpenMP API specifications. From OpenMP v3.0 API, Section 2.9.3.6, reduction clause, p. 98:

"A private copy of each list item is created, one for each implicit task, as if the private clause had been used. The private copy is then initialized to the initialization value for the operator, as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the operator specified. (The partial results of a subtraction reduction are added to form the final value.)"

The reason I know this for sure is that I've been on every OpenMP langauge committee since the very first one was convened (in 1996). Furthermore,I've implemented OpenMP reductions in the KAI (Kuck and Associates, Inc.) KAP/Pro Toolset and Intel Compiler run-time libraries. This is the first time that anyone has claimed our implementation of reduction is broken in such a fundamental way.

- Grant

jimdempseyatthecove · ‎04-12-2010

Thanks for clarifying this. The documentation (in particular IVF) needs to explain this clearer. In re-reading the IVF documentation, a large table is placed in the middle of the paragraph explaining this. Seperating the interior initialization.

{big table here}

From the reduction operator against the original value.

Jim