I run this code on Intel Core 2 duo
!$OMP DO PRIVATE(I)
DO I = 1,NEQTOT
XA(I) = XA(I) + RMIN * DD(I)
!$OMP END DO
!$OMP END PARALLEL
When run with Openmp disabled or sequential, the values of XA are the same.
When run with Openmp parallel, some of the values ofXA are slightly different, as in the last significant digit.
Should the values not be the same with 2 or more cpus?
Thanks for any feedback.
This looks like vectorizable Fortran. As you haven't shown enough to answer the question, I'll assume you have no threading errors. If not all threads get 16-byte aligned chunks, there would be various ways (particularly with ifort default real, more so with past versions) where you could get extra precision for a few loop iterations at the beginning and end of the chunk, differing from the accuracy of source precision in the vectorized portions of the loop. You might find that you get the same results regardless of number of threads, with vectorization disabled.
Theremay beno benefit in attempting OpenMP on top of vectorization in a loop such as this.
Thank you for your response.
I tried to disable automatic vectorization as you suggested. I turned on full vectorization diagnostics to see if the loop was vectorized but it was not. I specifically tried to disable it using !DEC$ NOVECTOR. However the result was always the same.
There are many other Openmp loops in the routine, some involving the same arrays as the current loop. The discrepancy in results between (no OMP, Sequential) and parallel OMP was identified by commenting out all the openmp directives eventually leaving only this loop.
I am trying to makemy Openmp code produce repeatable results regardless of no of cpus, so that debugging can be easier.