- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We wish to perform a reduction using quad precision. Using Intel MPI 2019.1.144, I consistently get wrong answers on only one node of a multi-node job. Setting I_MPI_COLL_INTRANODE=pt2pt corrects the problem, but this is not an acceptable solution (advising our users that they must set this env var, or else the code will hang).
I'll first ask for verification that we are writing the correct code to accomplish what we want. Here is a test case:
! Program initially copied from Jeff Hammond's answer to
! https://stackoverflow.com/questions/33109040/strange-result-of-mpi-allreduce-for-16-byte-real/34230377
program test_reduce_sum_real16
use mpi
implicit none
integer, parameter :: qp = selected_real_kind(p=32,r=4931)
integer, parameter :: n = 1
real(kind=qp) :: output(n)
real(kind=qp) :: input(n)
real(kind=qp) :: error
integer :: me, np
integer :: mysum ! MPI_Op
integer :: i, j, k, ierr
logical :: fail
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, np, ierr)
call MPI_Op_create(Mpi_Quad_Sum, .true., mysum, ierr)
output = 0.0
! Try many times, until failure
j = 0
fail = .false.
do while(.not.fail)
j = j + 1
! Seed the input array
input(1:n) = j/(1.0_qp + me)
! Sum over all processes
call MPI_Allreduce(input, output, n, MPI_REAL16, mysum, MPI_COMM_WORLD, ierr)
! Check answer, one component at a time
do k = 1, n
error = output(k)
do i = 1, np
error = error - real(j,qp)/real(i,qp)
enddo
fail = (error > 0.0_qp)
if (fail) print *,'FAIL',me,error,j
call MPI_Allreduce(MPI_IN_PLACE, fail, 1, MPI_LOGICAL, MPI_LOR, MPI_COMM_WORLD, ierr)
if (me == 0) then
if (fail) then
print *,'Failed',k,j
else
print *,'Passed',k,j
endif
endif
enddo
enddo
call MPI_Op_free(mysum, ierr)
call MPI_Finalize(ierr)
contains
subroutine Mpi_Quad_Sum(invec, inoutvec, len, datatype)
implicit none
integer, intent(in) :: len, datatype
real(kind=qp), intent(in) :: invec(len)
real(kind=qp), intent(inout) :: inoutvec(len)
integer i
do i = 1, len
inoutvec(i) = invec(i) + inoutvec(i)
end do
end subroutine Mpi_Quad_Sum
end program test_reduce_sum_real16
I run this with the following command line (this is on a system with 24-core nodes):
mpiexec -genv I_MPI_PIN_DOMAIN=omp \
-genv I_MPI_COLL_INTRANODE=shm \
-genv OMP_NUM_THREADS=6 -genv OMP_PROC_BIND=spread -genv OMP_PLACES=cores \
-genv MV2_USE_APM=0 -genv I_MPI_DEBUG=5 -genv KMP_AFFINITY=verbose \
-np 20 -ppn 4 ./test
And I see failures only on ranks 12-15 (one node):
Passed 1 1
Passed 1 2
Failed 1 3
FAIL 12 7.703719777548943412223911770339709E-0034 3
FAIL 13 7.703719777548943412223911770339709E-0034 3
FAIL 14 7.703719777548943412223911770339709E-0034 3
FAIL 15 7.703719777548943412223911770339709E-0034 3
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
We can't see any OpenMP code in the test code provided by you. But, we can see that you are using OpenMP flags while running the MPI code. You can ignore those flags while running the program.
Could you please provide the following details to investigate more on your issue?
1. Operating system
2. Libfabric provider(FI_PROVIDER) being used.
3. Please provide the expected screenshot.
4. Please provide us the complete debug log using the below commands:
mpiifort test.f90 -o test
I_MPI_DEBUG=30 mpirun -n 20 -ppn 4 -f hostfile ./test
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the reply. The information you requested:
- CentOS 7
- [0] MPI startup(): libfabric provider: verbs;ofi_rxm
-
The expected behavior is for all processes to agree on the result. It is likely that all will fail the test, for example:
Passed 1 1 Passed 1 2 Passed 1 3 Passed 1 4 FAIL 0 2.311115933264683023667173531101913E-0033 5 FAIL 1 2.311115933264683023667173531101913E-0033 5 FAIL 2 2.311115933264683023667173531101913E-0033 5 FAIL 3 2.311115933264683023667173531101913E-0033 5 FAIL 16 2.311115933264683023667173531101913E-0033 5 FAIL 8 2.311115933264683023667173531101913E-0033 5 FAIL 12 2.311115933264683023667173531101913E-0033 5 FAIL 13 2.311115933264683023667173531101913E-0033 5 FAIL 9 2.311115933264683023667173531101913E-0033 5 FAIL 14 2.311115933264683023667173531101913E-0033 5 FAIL 10 2.311115933264683023667173531101913E-0033 5 FAIL 15 2.311115933264683023667173531101913E-0033 5 FAIL 11 2.311115933264683023667173531101913E-0033 5 FAIL 17 2.311115933264683023667173531101913E-0033 5 FAIL 18 2.311115933264683023667173531101913E-0033 5 FAIL 19 2.311115933264683023667173531101913E-0033 5 Failed 1 5 FAIL 4 2.311115933264683023667173531101913E-0033 5 FAIL 5 2.311115933264683023667173531101913E-0033 5 FAIL 6 2.311115933264683023667173531101913E-0033 5 FAIL 7 2.311115933264683023667173531101913E-0033 5
- Requested output:
[0] MPI startup(): libfabric version: 1.7.0a1-impi [0] MPI startup(): libfabric provider: verbs;ofi_rxm [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 214272 cn0056 {0,1,2,3,4,5} [0] MPI startup(): 1 214273 cn0056 {12,13,14,15,16,17} [0] MPI startup(): 2 214274 cn0056 {6,7,8,9,10,11} [0] MPI startup(): 3 214275 cn0056 {18,19,20,21,22,23} [0] MPI startup(): 4 195540 cn0171 {0,1,2,3,4,5} [0] MPI startup(): 5 195541 cn0171 {12,13,14,15,16,17} [0] MPI startup(): 6 195542 cn0171 {6,7,8,9,10,11} [0] MPI startup(): 7 195543 cn0171 {18,19,20,21,22,23} [0] MPI startup(): 8 183946 cn0201 {0,1,2,3,4,5} [0] MPI startup(): 9 183947 cn0201 {12,13,14,15,16,17} [0] MPI startup(): 10 183948 cn0201 {6,7,8,9,10,11} [0] MPI startup(): 11 183949 cn0201 {18,19,20,21,22,23} [0] MPI startup(): 12 197548 cn0284 {0,1,2,3,4,5} [0] MPI startup(): 13 197549 cn0284 {12,13,14,15,16,17} [0] MPI startup(): 14 197550 cn0284 {6,7,8,9,10,11} [0] MPI startup(): 15 197551 cn0284 {18,19,20,21,22,23} [0] MPI startup(): 16 195780 cn0347 {0,1,2,3,4,5} [0] MPI startup(): 17 195781 cn0347 {12,13,14,15,16,17} [0] MPI startup(): 18 195782 cn0347 {6,7,8,9,10,11} [0] MPI startup(): 19 195783 cn0347 {18,19,20,21,22,23} Passed 1 1 Passed 1 2 Failed 1 3 FAIL 12 7.703719777548943412223911770339709E-0034 3 FAIL 13 7.703719777548943412223911770339709E-0034 3 FAIL 14 7.703719777548943412223911770339709E-0034 3 FAIL 15 7.703719777548943412223911770339709E-0034 3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this internally and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Intel MPI behavior is correct. Instead the provided example has to take into account FP rounding and representation.
The differences between the results of MPI_Allreduce and and the locally computed values are close to epsilon (e.g: ~3.8E-0034, see https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/language-reference/a-to-z-reference/e-to-f/epsilon.html)
So you may replace the line below in the provided example:
fail = (error > 0.0_qp)
with something like this (please note that it is not a generic advise):
fail = (error > (100*epsilon(error)))
Please notice the discussion at https://stackoverflow.com/questions/4915462/how-should-i-do-floating-point-comparison as a short guideline. Or the following paper for more solid justification: David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5–48. https://doi.org/10.1145/103162.103163

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page