Roundoff error in OpenMP

Bon · ‎04-18-2009

Hello, I'm new in OpenMP. When studying the OpenMP, I got the different result in the following code:

C
C --- Matrix - Vector multiplication
C Sparse storage(CSR)
C
implicit real*8(a-h,o-z)

.............

!$OMP Parallel
!$OMP Do
do 100 i = 1, n
temp = 0.0d0
do 50 j = ix(i), ix(i+1) - 1
temp = temp + v(i) + ir(x(i))
50continue
y(i) = temp
100 continue

When run with OpenMP parallel, the resultis different from sequential code.

Testing system is
Window XP/Professional 64 bit, Intel fortran compiler 10.x, MicroSoft Visual Studio 2005, Intel Xeon 3.8GHz

Compiler Option

Fortran -> Optimization -> Paralleization -> No
-> Preprocessor -> OpenMP Conditional Compilation -> Yes
-> Process OpenMP Directives -> Generate Parallel Code
-> Data -> Default Double Precision KIND -> 8
Others -> Default

I want to get the answer; How to change the Compiler Options to get the same value.
I read the same question, but I don't understand the answer. ( i'm novice )

thanks in advance

Bon, Koo

jimdempseyatthecove · ‎04-18-2009

Bon, Koo

The variable temp is being used as read/modify/write by multiple threads.
Same with the variable j.
Make temp and jprivate to each thread.
Note, variable i is implicitly private due to it being the indexing variable of a OMP DO

Use
!$OMP Parallel private(temp, j)
!$OMP Do
do 100 i = 1, n
temp = 0.0d0
do 50 j = ix(i), ix(i+1) - 1
temp = temp + v(i) + ir(x(i))
50continue
y(i) = temp
100 continue

Jim Dempsey

TimP · ‎04-18-2009

Quoting - jimdempseyatthecove

The variable temp is being used as read/modify/write by multiple threads.
Same with the variable j.
Make temp and jprivate to each thread.
Note, variable i is implicitly private due to it being the indexing variable of a OMP DO

Use
!$OMP Parallel private(temp, j)
!$OMP Do
do 100 i = 1, n
temp = 0.0d0
do 50 j = ix(i), ix(i+1) - 1
temp = temp + v(i) + ir(x(i))
50continue
y(i) = temp
100 continue

j is private by default, according to the rules of OpenMP for Fortran. This is a significant difference from the C rules. It doesn't seem clear whether default(none) would change this, so specifying it is a good recommendation.
It is extremely important to specify temp as a private so as to permit optimization as well as avoiding unpredictable results.
If the work isn't balanced among threads, schedule(guided) may be useful in such situations, after the correctness issues are dealt with.
Roundoff error may be controlled, and performance improved, if the order of addition is specified:
temp = temp + (v(i) + ir(x(i)))

ifort would require the option /assume:protect_parens to require following the Fortran standard here.
There could still be variations in roundoff error, for example between SSE2 and SSE4 code, if the SSE4 code optimized by increasing the number of individual sums in the inner loop.

robert-reed · ‎04-18-2009

Quoting - tim18

j is private by default, according to the rules of OpenMP for Fortran. This is a significant difference from the C rules.

Not sure where this bit of wisdom is coming from, but it's not correct. The OpenMP 3.0 standard describes the syntax of the Loop construct for C/C++ and defines veryspecific forms allowed for the associated for-loop. Part of that specification describes the loop inference variable thusly:

var One of the following
A variable of a signed or unsigned integer type.
For C++, a variable of a random access iterator type.
For C, a variable of a pointer type.
If this variable would otherwise be shared, it is implicitly made private in the loop construct....

Thus the OpenMP behavior for the loop-construct inference variableis the same, whether Fortran or C/C++.

TimP · ‎04-18-2009

The parallelized loop variable defaults to private for both C and Fortran. As far as I can see, that is what the reference in the OpenMP 3.0 standard is discussing. No publicized change has been made in OpenMP 3.0 to the default for inner loops inside a parallized loop. Those did not have a default private in C, in OpenMP 2.0, accounting for Jim's statement. If OpenMP 3.0 has made them default private, that would be interesting news.
If I read Robert's comment correctly, he disagrees with both Jim and me about the C defaults, but agrees with me about the Fortran defaults.
The textbook "Using OpenMP," by Chapman, Jost, and van der Paas, written less than a year ago, states that it is based on OpenMP 2.5. It appears to recommend not relying on the default private for the parallel loop iteration parameter, but doesn't explain to what extent default(none) may influence the situation. It mentions 2.5 features briefly, without mentioning the lack of implementation in compilers for x86. It further inclines me to second Jim's recommendation, as the distinction I ponted out between Fortran and C is not well explained.
I haven't seen any discussion of why the inner loop parameters were treated differently in Fortran and C, but one evident difference is the definition of Fortran, where the loop count is calculated before the loop starts, while C or C++ compilers typically look for tinkering with the loop count inside the loop before proceeding with optimization.
C99 has some convenient features for OpenMP, but those still aren't available in all C compilers, and they don't change the situation about permitting dynamic alteration of loop count, so I don't see why OpenMP would have changed the relative treatment of Fortran and C, without publicizing any such change.
It would also be interesting to know which level of OpenMP is implemented in Intel compilers, since there seems to be priority on 3.0 features, while adopting a wait-and-see attitude about 2.5. gfortran 4.5 is trying to implement OpenMP 2.5, so maybe the wait-and-seers will have something to see.
The marketing guys think Intel compilers should gain a competitive advantage by implementing 3.0 features without 2.5, but it seems reasonable to expect most people to try to stay with 2.0 until 2.5 is generally implemented. If 3.0 has made a change in the defaults, intentionally or not, that introduces a potential additonal level of confusion.

Bon · ‎04-19-2009

Quoting - tim18

The parallelized loop variable defaults to private for both C and Fortran. As far as I can see, that is what the reference in the OpenMP 3.0 standard is discussing. No publicized change has been made in OpenMP 3.0 to the default for inner loops inside a parallized loop. Those did not have a default private in C, in OpenMP 2.0, accounting for Jim's statement. If OpenMP 3.0 has made them default private, that would be interesting news.
If I read Robert's comment correctly, he disagrees with both Jim and me about the C defaults, but agrees with me about the Fortran defaults.
The textbook "Using OpenMP," by Chapman, Jost, and van der Paas, written less than a year ago, states that it is based on OpenMP 2.5. It appears to recommend not relying on the default private for the parallel loop iteration parameter, but doesn't explain to what extent default(none) may influence the situation. It mentions 2.5 features briefly, without mentioning the lack of implementation in compilers for x86. It further inclines me to second Jim's recommendation, as the distinction I ponted out between Fortran and C is not well explained.
I haven't seen any discussion of why the inner loop parameters were treated differently in Fortran and C, but one evident difference is the definition of Fortran, where the loop count is calculated before the loop starts, while C or C++ compilers typically look for tinkering with the loop count inside the loop before proceeding with optimization.
C99 has some convenient features for OpenMP, but those still aren't available in all C compilers, and they don't change the situation about permitting dynamic alteration of loop count, so I don't see why OpenMP would have changed the relative treatment of Fortran and C, without publicizing any such change.
It would also be interesting to know which level of OpenMP is implemented in Intel compilers, since there seems to be priority on 3.0 features, while adopting a wait-and-see attitude about 2.5. gfortran 4.5 is trying to implement OpenMP 2.5, so maybe the wait-and-seers will have something to see.
The marketing guys think Intel compilers should gain a competitive advantage by implementing 3.0 features without 2.5, but it seems reasonable to expect most people to try to stay with 2.0 until 2.5 is generally implemented. If 3.0 has made a change in the defaults, intentionally or not, that introduces a potential additonal level of confusion.

Thank you everyone!!! There's some error in question. Please see the code
C
C --- Matrix - Vector multiplication
C Sparse storage(CSR)
C
implicit real*8(a-h,o-z)
real*8 ir(*)
integer x(*)
.............
C--- OpenMP
!$OMP Parallel private(temp, is, ie ) !!!! variable 'j' -> private as default in fortran
!$OMP Do
do 100 i = 1, n
temp = 0.0d0
is = ix(i)
ie = ix(i+1) - 1
do 50 j = is, ie
temp = temp + v(i)* ir(x(i)) !!!! Error in question
50 continue
y(i) = temp
100 continue
!$OMP End parallel
C --- Serial Code
do 200 i = 1, n
temp = 0.0d0
is = ix(i)
ie = ix(i+1) - 1
do 150 j = is, ie
temp = temp + v(i) * ir(x(i))
150 continue
y1(i) = temp
200 continue
C
s = 0.0d0
t = 0.0d0
do 300 i = 1, n
s = s + y(i)
t = t + y1(i)
300 continue
c
print *,' s - t ',s-t

After execution, 's-t' have a order of value about 1.0d-13 ~ 1.0d-21
and, sometimes s-t equals to zero.

Change of options,

Optimization -> Use intel Processor Extensions -> Intel Pentium 4 Processor with SSE3 ( /Qaxp )

But i got the same result!!!!

I want to know how to fix this error !!!!!

jimdempseyatthecove · ‎04-20-2009

The inner loop (do 50) is NOT an OpenMPparallel DO. Therefore its indexing variable J has default characteristics. For Fortran default variable attribute is SHARED. You must makeJ PRIVATE. Place the private on the !$OMP PARALLEL. That is, add J to !$OMP Parallel private(temp, is, ie, j)

Note, your outer loop

!$OMP Do
do 100 i = 1, n

IS an OpenMPparallel DO. Therefore its indexing variableI has default characteristics of PRIVATE. But you can add PRIVATE(i) to make the code clear to the (novice) reader that the variable i is private to each thread.

Also Fortran computes the iteration space of a DO loop once. Therefor, unless is and ie are used elsewhere (or if use makes code easire to read), use ofthe expression may be clearer

do 50 j = ix(i), ix(i+1)-1

With respect to C++ you must identify the scope of the variable in addition toif it is or is not an indexing variable

#pragma omp parallel for
for(int i=1;i<=n;++i)
...
for(int j=is;j<=ie;++j)

In the above, j is declared inside the scope of the for loop and therefore is private.

Whereas

int j;
#pragma omp parallel for
for(int i=1;i<=n;++i)
...
for(j=is;j<=ie;++j)

In the above, j is declared outside the scope of the for loop and therefore is shared.

So

In the first case, when you read the OpenMP specification example that simply states

"the inner loop variable j is private"

You MUST factor in the scope of j. That is

because j is declared inside the parallel region "the inner loop variable j is private"

Unfortunately, many of the document writers assume the reader will know this (and thus omit what they would think is obvious).

Jim Dempsey

robert-reed · ‎04-20-2009

Quoting - jimdempseyatthecove

"the inner loop variable j is private"

You MUST factor in the scope of j. That is

because j is declared inside the parallel region "the inner loop variable j is private"

This looks correct and is supported by the 3.0 specification:

The loop iteration variable(s) in the associated do-loop(s) of a do or parallel do construct is(are) private.
A loop iteration variable for a sequential loop in a parallel or task construct is private in the innermost such construct that encloses the loop.

Given the variability of the errors Koo reported (1.0e-13 to 1.0e-21), perhaps the problem is not the sharing or privatization of variables in parallel loops at all. Perhaps it's simple accumulatedfloating point round-off error due to variabilities in order of evaluation?

Bon · ‎04-21-2009

Hi, I changed machine and OS.
Window XP/32bit, Intel Xeon X5355 Quad Core( I used 2-single CPU before)

I settle the problem.

But, I confused!!!!

TimP · ‎04-21-2009

Quoting - kbh90@vmtech.co.kr

Hi, I changed machine and OS.
Window XP/32bit, Intel Xeon X5355 Quad Core( I used 2-single CPU before)

I settle the problem.

But, I confused!!!!

When you have race conditions, as in the code you posted, it is possible on the single quad core CPU for hardware locks to delay read-after-write and write-after-write conflicts so that they don't often create errors. So, the dual socket is a more severe test of threading correctness. In combination with the compiler optimizing the variable which should have been set private, giving each core its own register copy as if it were private, the code bug may not show up immediately.