OMP different results

kinni · ‎02-03-2022

Hi,

I have a fortran code which is programmed by using OMP and therefore executes in parallel. I have a problem, that the results differ sometimes for more than 1% to 2 %. However when I run the code on only 1 thread than it is exact. But when I compile the program by using PGI compiler it works fine even with pretty much threads. So this suggests to me the programming should be OK.

I found on internet that it is possible that results differ in combination with intel OMP and I tried different strategies I found, but nothing helped.

PS. The code is pretty large so I cannot post it here and also because of rights.

Thanks for any help / suggestions!

Ron_Green · ‎02-03-2022

When I saw the title of this post my first reaction was "yes, of course you did".

Followed by reading your post and thinking "only 1-2% difference with OMP without any controls. lucky."

Which is to say your expectations for parallelism, optimization, and accuracy and not in line with the reality of finite precision floating point arithmetic.

Please read the attached PDF. Also when you are ready to understand more about how computers do arithmetic, here are more articles:

https://www.intel.com/content/dam/develop/external/us/en/documents/fp-consistency-102511-326704.pdf

https://www.intel.com/content/dam/develop/external/us/en/documents/fp-control-2013-03.pdf

So a quick thing you can go try is this

compiler option -qno-opt-dynamic-align linux or macOS or /Qopt-dynamic-align- on windows.

It's probably in a VS property sheet under floating point properties or runtime properties. Someone with a Windows system can chime in.

Next, set static scheduling by adding the SCHEDULE( STATIC ) clause(modifier) to your OMP clauses. If you want to get really strict, use the MONOTONIC modifier but expect performance to tank. However this is the only way to control reductions to get the same behavior as the serial ordering. Ultimately the other suggestions I have here can help with run-to-run but can't change the fact that reductions are done differently with OMP reductions without MONOTONIC (or removing the OMP directives around loops with reductions).

Before you run

linux macOS: export KMP_DETERMINISTIC_REDUCTION=yes

Windows command line: set KMP_DETERMINISTIC_REDUCTION=yes

This will help in run-to-run reproducibility. Under VS I think there is a place to set environment variables, under Properties, General, Debug, Environment if I remember. Again, need a Windows user to chime in to help here.

In short, I would expect slightly different results as you add OpenMP as the order of operations can change. 1-2% is enviable.

Now you made a comment about not seeing the same with another compiler. To which I would say you cannot expect 2 different compilers and OpenMP runtimes to behave the same. Just as if you compare a Ferrari and a Honda. If you plant your foot on the accelerator in a Ferrari you get much different result than doing the same in a Honda Civic.

kinni · ‎02-03-2022

Ron thank you very much for your comments!

I have tried what you suggested:

1) I included the compiler option /Qopt-dynamic-align- into my *.bat when compiling on command line.

2) I apply all REDUCTIONS by myself explicitely after finishing the loops, so I did not use the SCHEDULE( STATIC ) clause as you suggested, as I think there is no need to do this.

3) I included the "set KMP_DETERMINISTIC_REDUCTION=yes" into my *.bat when starting run.

The result is that nothing has changed at all.

Here is the part of my code, how I perallelize DO LOOPS and performe REDUCTIONS:

c*************************************************************
iend_k = INOD

NPROC_k = NPROC
nstep_k = 1 + ( iend_k / NPROC_k )
small = 1.e-6

!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE( ll, is, ie, I, IJ, ahu_p_max )

DO k = 1, NPROC_k

is = 1 + (k-1) * nstep_k
ie = k * nstep_k

is = max ( is, istart_k )
ie = min ( ie, iend_k )

ahu_p_max = small
do ll = is, ie
ij = Nodes(ll)
ahu_p_max = amax1( ahu_p_max, ah_p(ij) )
enddo
fi_pr_1(k) = ahu_p_max
ENDDO
!$OMP END PARALLEL DO

ahu_p_max = 1.
do k = 1, nproc
ahu_p_max = amax1( ahu_p_max, fi_pr_1(k) )
enddo

c*************************************************************

Sincerely

Steve_Lionel · ‎02-03-2022

In Visual Studio on Windows you would have to enter this in Command Line > Additional Options.

jimdempseyatthecove · ‎02-03-2022

Add some diagnostic code:

c*************************************************************
iend_k = INOD

NPROC_k = NPROC
nstep_k = 1 + ( iend_k / NPROC_k )
small = 1.e-6


!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE( ll, is, ie, I, IJ, ahu_p_max )

DO k = 1, NPROC_k

is = 1 + (k-1) * nstep_k
ie = k * nstep_k

is = max ( is, istart_k )
ie = min ( ie, iend_k )

!$omp critical
print *, omp_get_thread_num(), is, ie
!$omp end critical

ahu_p_max = small
do ll = is, ie
ij = Nodes(ll)
ahu_p_max = amax1( ahu_p_max, ah_p(ij) )
enddo
fi_pr_1(k) = ahu_p_max
ENDDO
!$OMP END PARALLEL DO


ahu_p_max = 1.
do k = 1, nproc
ahu_p_max = amax1( ahu_p_max, fi_pr_1(k) )
enddo

c*************************************************************

Jim Dempsey

kinni · ‎02-03-2022

Here is the screenshot after inserting critical + print:

For me it looks like it should be. Just to mention: INOD = 2001.

Ron_Green · ‎02-03-2022

tell me about function 'amax1'. Is it PURE ?

kinni · ‎02-03-2022

the arguments in amax1 are all defined as real*4: ahu_p_max, ah_p(ij), fi_pr_1(k).

kinni · ‎02-04-2022

Just to mention, his part of code is not raunning in parallel:

ahu_p_max = 1.
do k = 1, nproc
ahu_p_max = amax1( ahu_p_max, fi_pr_1(k) )
enddo

kinni · ‎02-05-2022

Just to mention, this part ...

JohnNichols · ‎02-06-2022

Try real*8

jimdempseyatthecove · ‎02-07-2022

Slightly modified code:


c*************************************************************
iend_k = INOD

NPROC_k = NPROC
! nstep_k = 1 + ( iend_k / NPROC_k )
nstep_k = 1 + ( (iend_k + NPROC_k - 1) / NPROC_k )

small = 1.e-6

!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE( ll, is, ie, I, IJ, ahu_p_max )

DO k = 1, NPROC_k

  is = 1 + (k-1) * nstep_k
  ie = k * nstep_k

  if( ie >= istart_k ) then
    if( is <= iend_k ) then
      is = max ( is, istart_k )
      ie = min ( ie, iend_k )

      ahu_p_max = small
      do ll = is, ie
        ij = Nodes(ll)
        ahu_p_max = amax1( ahu_p_max, ah_p(ij) )
      enddo
      fi_pr_1(k) = ahu_p_max
    endif
  endif
ENDDO
!$OMP END PARALLEL DO

ahu_p_max = 1.
do k = 1, nproc
  ahu_p_max = amax1( ahu_p_max, fi_pr_1(k) )
enddo

c*************************************************************

Note, lines 25:30 can be replaced with:

fi_pr_1(k) = max( small, maxval( ah_p(is:ie) ) )

Similar thing for lines 36:39

Jim Dempsey

kinni · ‎02-07-2022

Jim, do you really think these changes should solve the main problem? Well, I'm not sure about it.

Just to mention, I also tried /fp:precise, /fp:strict, /fpe-all:1 etc. with no results.

There is no need to change to real*8, because real*4 works satisfactory in junction with other compilers (like PGI compiler).

kinni · ‎02-10-2022

Jim,

I changed the things as you suggested, but the problem still remains the same.

Note that the changes you suggested:

if( ie >= istart_k ) then

if( is <= iend_k ) then

have no influence, because "is" is always <= iend_k, and "ie" is always >= istart_k.

Also:

nstep_k = 1 + ( (iend_k + NPROC_k - 1) / NPROC_k )

nstep_k , has no effect on the final result. It is normally very large no. (> 10000), so doesn't matter if 9999 or 10001.

I would appreciate any further ideas?

Thanks!

jimdempseyatthecove · ‎02-10-2022

Try the following:

!DIR$ NOVECTOR

do ll = is, ie

That code may possibly have generated SIMD gather instructions. The above will disable SIMD for that loop (had it been enabled).

!DIR$ VECTOR ALWAYS might force it on (in the event that !$omp parallel... forced it off).

Experiment both ways. (ignore warnings)

Jim Dempsey

kinni · ‎02-16-2022

Jim,
thank you for your suggestions. I also tried !DIR$ NOVECTOR but it did not help.

I have analysed the problem intensively these days and I checked the code but
did not found any errors.

However, it seems the solution to the particullar problem I have considered is not stable.
That means, there is a kind of resonance and the solution oszillates a little bit.
I assumed however a stable (steady state) solution.

Seems that here OMP reacts more sensitively and the oszillations are more pronaunced
as well as they appear earlier if more threads are used.
If only one thread is used, the oszillations are very small and can be neglected.

Also the oszillations are smaller when using the PGI compiler and this was confusing
to me and apparently led to the false conclusions.

I think the problem has been resolved so far.
Sorry for bordering you.
In any case, thank you for your comments as well as to the others contributors!

Sincerely

jimdempseyatthecove · ‎02-16-2022

>>OMP reacts more sensitively and the oszillations are more pronaunced

Consider this section of your code:

      ahu_p_max = small
      do ll = is, ie
        ij = Nodes(ll)
        ahu_p_max = amax1( ahu_p_max, ah_p(ij) )
      enddo
      fi_pr_1(k) = ahu_p_max

As you parallelize the (surrounding) code, the range is:ie gets smaller, and as a result, the probability that the value small will be returned as opposed to a value above small.

Varying the range (as a result of varying the number of threads) will change the probability of using the substituted value small for the selected largest value.

I presume you will need to make adjustments to your code.

One possibility would be to return fi_pr_1(k) = max(ah_p(Nodes(is:ie))

This could be a number less than small, and then the code that later uses fi_pr_1(k) would adjust for values less than small.

Jim Dempsey

kinni · ‎02-18-2022

Jim,

thank you for your suggestions, but I think this is not the critical part of the code.

small is 1.e-6, and

ahu_p_max

is usually mcu much bigger O(1).

Thanks again.