Wrong answers with SHM collective using quad precision MPI user-defined operation

SMW · ‎05-05-2022

We wish to perform a reduction using quad precision. Using Intel MPI 2019.1.144, I consistently get wrong answers on only one node of a multi-node job. Setting I_MPI_COLL_INTRANODE=pt2pt corrects the problem, but this is not an acceptable solution (advising our users that they must set this env var, or else the code will hang).

I'll first ask for verification that we are writing the correct code to accomplish what we want. Here is a test case:

! Program initially copied from Jeff Hammond's answer to
! https://stackoverflow.com/questions/33109040/strange-result-of-mpi-allreduce-for-16-byte-real/34230377

program test_reduce_sum_real16
   use mpi

   implicit none
   integer, parameter :: qp = selected_real_kind(p=32,r=4931)
   integer, parameter :: n = 1
   real(kind=qp) :: output(n)
   real(kind=qp) :: input(n)
   real(kind=qp) :: error
   integer :: me, np
   integer :: mysum ! MPI_Op
   integer :: i, j, k, ierr
   logical :: fail

   call MPI_Init(ierr)
   call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr)
   call MPI_Comm_size(MPI_COMM_WORLD, np, ierr)

   call MPI_Op_create(Mpi_Quad_Sum, .true., mysum, ierr)

   output = 0.0

   ! Try many times, until failure
   j = 0
   fail = .false.
   do while(.not.fail)
      j = j + 1
      ! Seed the input array
      input(1:n) = j/(1.0_qp + me)
      ! Sum over all processes
      call MPI_Allreduce(input, output, n, MPI_REAL16, mysum, MPI_COMM_WORLD, ierr)
      ! Check answer, one component at a time
      do k = 1, n
         error = output(k)
         do i = 1, np 
            error = error - real(j,qp)/real(i,qp)
         enddo
         fail = (error > 0.0_qp)
         if (fail) print *,'FAIL',me,error,j
         call MPI_Allreduce(MPI_IN_PLACE, fail, 1, MPI_LOGICAL, MPI_LOR, MPI_COMM_WORLD, ierr)
         if (me == 0) then
            if (fail) then
               print *,'Failed',k,j
            else
               print *,'Passed',k,j
            endif
         endif
      enddo
   enddo

   call MPI_Op_free(mysum, ierr)
   call MPI_Finalize(ierr)

 contains

  subroutine Mpi_Quad_Sum(invec, inoutvec, len, datatype)
    implicit none
    integer, intent(in) :: len, datatype
    real(kind=qp), intent(in) :: invec(len)
    real(kind=qp), intent(inout) :: inoutvec(len)
    integer i
    do i = 1, len
      inoutvec(i) = invec(i) + inoutvec(i)
    end do
  end subroutine Mpi_Quad_Sum

end program test_reduce_sum_real16

I run this with the following command line (this is on a system with 24-core nodes):

  mpiexec -genv I_MPI_PIN_DOMAIN=omp \
          -genv I_MPI_COLL_INTRANODE=shm \
          -genv OMP_NUM_THREADS=6 -genv OMP_PROC_BIND=spread -genv OMP_PLACES=cores \
          -genv MV2_USE_APM=0 -genv I_MPI_DEBUG=5 -genv KMP_AFFINITY=verbose \
          -np 20 -ppn 4 ./test

And I see failures only on ranks 12-15 (one node):

 Passed           1           1
 Passed           1           2
 Failed           1           3
 FAIL          12  7.703719777548943412223911770339709E-0034           3
 FAIL          13  7.703719777548943412223911770339709E-0034           3
 FAIL          14  7.703719777548943412223911770339709E-0034           3
 FAIL          15  7.703719777548943412223911770339709E-0034           3

HemanthCH_Intel · ‎05-06-2022

Hi,

Thank you for posting in Intel Communities.

We can't see any OpenMP code in the test code provided by you. But, we can see that you are using OpenMP flags while running the MPI code. You can ignore those flags while running the program.

Could you please provide the following details to investigate more on your issue?

1. Operating system

2. Libfabric provider(FI_PROVIDER) being used.

3. Please provide the expected screenshot.

4. Please provide us the complete debug log using the below commands:

mpiifort test.f90 -o test
I_MPI_DEBUG=30 mpirun -n 20 -ppn 4 -f hostfile ./test

Thanks & Regards,

Hemanth

SMW · ‎05-06-2022

Thank you for the reply. The information you requested:

CentOS 7
[0] MPI startup(): libfabric provider: verbs;ofi_rxm

The expected behavior is for all processes to agree on the result. It is likely that all will fail the test, for example:

 Passed           1           1
 Passed           1           2
 Passed           1           3
 Passed           1           4
 FAIL           0  2.311115933264683023667173531101913E-0033           5
 FAIL           1  2.311115933264683023667173531101913E-0033           5
 FAIL           2  2.311115933264683023667173531101913E-0033           5
 FAIL           3  2.311115933264683023667173531101913E-0033           5
 FAIL          16  2.311115933264683023667173531101913E-0033           5
 FAIL           8  2.311115933264683023667173531101913E-0033           5
 FAIL          12  2.311115933264683023667173531101913E-0033           5
 FAIL          13  2.311115933264683023667173531101913E-0033           5
 FAIL           9  2.311115933264683023667173531101913E-0033           5
 FAIL          14  2.311115933264683023667173531101913E-0033           5
 FAIL          10  2.311115933264683023667173531101913E-0033           5
 FAIL          15  2.311115933264683023667173531101913E-0033           5
 FAIL          11  2.311115933264683023667173531101913E-0033           5
 FAIL          17  2.311115933264683023667173531101913E-0033           5
 FAIL          18  2.311115933264683023667173531101913E-0033           5
 FAIL          19  2.311115933264683023667173531101913E-0033           5
 Failed           1           5
 FAIL           4  2.311115933264683023667173531101913E-0033           5
 FAIL           5  2.311115933264683023667173531101913E-0033           5
 FAIL           6  2.311115933264683023667173531101913E-0033           5
 FAIL           7  2.311115933264683023667173531101913E-0033           5

Requested output:

[0] MPI startup(): libfabric version: 1.7.0a1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       214272   cn0056     {0,1,2,3,4,5}
[0] MPI startup(): 1       214273   cn0056     {12,13,14,15,16,17}
[0] MPI startup(): 2       214274   cn0056     {6,7,8,9,10,11}
[0] MPI startup(): 3       214275   cn0056     {18,19,20,21,22,23}
[0] MPI startup(): 4       195540   cn0171     {0,1,2,3,4,5}
[0] MPI startup(): 5       195541   cn0171     {12,13,14,15,16,17}
[0] MPI startup(): 6       195542   cn0171     {6,7,8,9,10,11}
[0] MPI startup(): 7       195543   cn0171     {18,19,20,21,22,23}
[0] MPI startup(): 8       183946   cn0201     {0,1,2,3,4,5}
[0] MPI startup(): 9       183947   cn0201     {12,13,14,15,16,17}
[0] MPI startup(): 10      183948   cn0201     {6,7,8,9,10,11}
[0] MPI startup(): 11      183949   cn0201     {18,19,20,21,22,23}
[0] MPI startup(): 12      197548   cn0284     {0,1,2,3,4,5}
[0] MPI startup(): 13      197549   cn0284     {12,13,14,15,16,17}
[0] MPI startup(): 14      197550   cn0284     {6,7,8,9,10,11}
[0] MPI startup(): 15      197551   cn0284     {18,19,20,21,22,23}
[0] MPI startup(): 16      195780   cn0347     {0,1,2,3,4,5}
[0] MPI startup(): 17      195781   cn0347     {12,13,14,15,16,17}
[0] MPI startup(): 18      195782   cn0347     {6,7,8,9,10,11}
[0] MPI startup(): 19      195783   cn0347     {18,19,20,21,22,23}
 Passed           1           1
 Passed           1           2
 Failed           1           3
 FAIL          12  7.703719777548943412223911770339709E-0034           3
 FAIL          13  7.703719777548943412223911770339709E-0034           3
 FAIL          14  7.703719777548943412223911770339709E-0034           3
 FAIL          15  7.703719777548943412223911770339709E-0034           3

HemanthCH_Intel · ‎05-13-2022

Hi,

We are working on this internally and will get back to you soon.

Thanks & Regards,

Hemanth

Klaus-Dieter_O_Intel · ‎01-09-2023

The Intel MPI behavior is correct. Instead the provided example has to take into account FP rounding and representation.

The differences between the results of MPI_Allreduce and and the locally computed values are close to epsilon (e.g: ~3.8E-0034, see https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/language-reference/a-to-z-reference/e-to-f/epsilon.html)

So you may replace the line below in the provided example:

fail = (error > 0.0_qp)

with something like this (please note that it is not a generic advise):

fail = (error > (100*epsilon(error)))

Please notice the discussion at https://stackoverflow.com/questions/4915462/how-should-i-do-floating-point-comparison as a short guideline. Or the following paper for more solid justification: David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5–48. https://doi.org/10.1145/103162.103163