Wrong limit on disp parameter in ILP64 version of MPI_File_set_view?

Jakub_Benda · ‎04-25-2019

Hi,

Both with some older (17.0.2) and the newest (19.0.3) Intel Fortran Compiler + Intel MPI I experience problems with the function MPI_File_set_view when using the 64-bit integer Fortran interface of Intel MPI (aka ILP64, using compiler switch -i8).

Whenever I set the write offset argument "disp" of this function outside of the int32 range, the call fails with the error code 201389836. However, the argument is of kind MPI_OFFSET_KIND, so it is supposed to support large values without any problems.

Curiously, this failure happens only with the ILP64 version, not with the LP64 version, which is the other way round than one would expect. As if there was an erroneous range check somewhere in the ILP64 interface before calling the underlying LP64 implementation, which actually supports large offsets.

Below is an example program that demonstrates the issue. The program writes an exactly 2-GiB integer array to a file, starting at offset 0. Then it attempts to position the next writing view at the end of the just written chunk and write one extra integer. It works well with Open MPI 4.0.0 ILP64 and Intel MPI 17.0.2/19.0.3 LP64 (prints 0) but fails with Intel MPI 17.0.2/19.0.3 ILP64 (prints 201389836). NB: This sample program is intended to be executed in single process only.

Did I hit a bug in Intel MPI?

program mpi_io_offset

    use iso_fortran_env, only: int32
    use mpi

    implicit none

    integer(int32), parameter :: mpiint = kind(MPI_COMM_WORLD)
    integer(int32), parameter :: mpiofs = MPI_OFFSET_KIND

    integer(mpiint) :: ierr, fh, stat(MPI_STATUS_SIZE), one = 1, num = 2**29
    integer(mpiofs) :: zero = 0, two_GiB_bytes

    integer(int32)              :: four_B_int = -1
    integer(int32), allocatable :: two_GiB_array(:)

    allocate (two_GiB_array(num))
    two_GiB_array(:) = 0
    two_GiB_bytes = num * 4_mpiofs

    call MPI_Init(ierr)
    call MPI_File_open(MPI_COMM_WORLD, 'file.bin', MPI_MODE_CREATE + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ierr)
    call MPI_File_set_size(fh, zero, ierr)
    call MPI_File_set_view(fh, zero, MPI_INTEGER4, MPI_INTEGER4, 'native', MPI_INFO_NULL, ierr)
    call MPI_File_write_all(fh, two_GiB_array, num, MPI_INTEGER4, stat, ierr)
    call MPI_File_set_view(fh, two_GiB_bytes, MPI_INTEGER4, MPI_INTEGER4, 'native', MPI_INFO_NULL, ierr)

    print *, ierr

    call MPI_File_write_all(fh, four_B_int, one, MPI_INTEGER4, stat, ierr)
    call MPI_File_close(fh, ierr)
    call MPI_Finalize(ierr)

end program mpi_io_offset

jbenda · ‎02-10-2021

This problem still persists in Intel oneAPI 2021 (or more specifically Intel(R) MPI Library for Linux* OS, Version 2021.1 Build 20201112).

jbenda · ‎01-12-2022

The program still fails in ILP64 mode with Intel(R) MPI Library for Linux* OS, Version 2021.5 Build 20211102.

jbenda · ‎10-12-2022

This program still fails in ILP64 mode with Intel(R) MPI Library for Linux* OS, Version 2021.7 Build 20220909.

$ mpiexec --version
Intel(R) MPI Library for Linux* OS, Version 2021.7 Build 20220909 (id: 6b6f6425df)
Copyright 2003-2022, Intel Corporation.

$ ifx --version
ifx (IFORT) 2022.2.0 20220730
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

$ mpifc -fc=ifx test.f90 -o test
$ ./test
           0

$ mpifc -fc=ifx test.f90 -o test -i8
$ ./test
             201385484

jbenda · ‎07-24-2023

To make some progress on this issue I have been debugging the library "/opt/intel/oneapi/mpi/2021.10.0/lib/release/libmpi_ilp64.so.4.1" in my installation of the latest Intel MPI. This is the disassembly of the broken function:

(gdb) disassemble mpi_file_set_view_
Dump of assembler code for function mpi_file_set_view__:
   0x000000000001f9e0 <+0>:     push   r12
   0x000000000001f9e2 <+2>:     push   r13
   0x000000000001f9e4 <+4>:     push   r14
   0x000000000001f9e6 <+6>:     push   r15
   0x000000000001f9e8 <+8>:     push   rbx
   0x000000000001f9e9 <+9>:     push   rbp
   0x000000000001f9ea <+10>:    sub    rsp,0x28
   0x000000000001f9ee <+14>:    mov    rbx,r8
   0x000000000001f9f1 <+17>:    mov    r15,QWORD PTR [rip+0x2225e8]        # 0x241fe0
   0x000000000001f9f8 <+24>:    mov    rbp,rcx
   0x000000000001f9fb <+27>:    mov    r12,rdx
   0x000000000001f9fe <+30>:    mov    r13,rsi
   0x000000000001fa01 <+33>:    mov    r14,rdi
   0x000000000001fa04 <+36>:    cmp    DWORD PTR [r15],0x0
   0x000000000001fa08 <+40>:    jne    0x1facf <mpi_file_set_view__+239>
   0x000000000001fa0e <+46>:    movsxd rdx,DWORD PTR [r13+0x0]
   0x000000000001fa12 <+50>:    mov    QWORD PTR [rsp],rdx
   0x000000000001fa16 <+54>:    mov    rdx,QWORD PTR [r12]
   0x000000000001fa1a <+58>:    mov    eax,DWORD PTR [r14]
   0x000000000001fa1d <+61>:    mov    DWORD PTR [rsp+0x8],eax
   0x000000000001fa21 <+65>:    lea    rcx,[rdx-0x4c000405]
   0x000000000001fa28 <+72>:    cmp    rcx,0x40
   0x000000000001fa2c <+76>:    jae    0x1fa48 <mpi_file_set_view__+104>
   0x000000000001fa2e <+78>:    mov    eax,0x1
   0x000000000001fa33 <+83>:    shl    rax,cl
   0x000000000001fa36 <+86>:    test   rax,0x1400001
   0x000000000001fa3c <+92>:    je     0x1fa48 <mpi_file_set_view__+104>
   0x000000000001fa3e <+94>:    mov    DWORD PTR [rsp+0xc],0x4c000809
   0x000000000001fa46 <+102>:   jmp    0x1fa4c <mpi_file_set_view__+108>
   0x000000000001fa48 <+104>:   mov    DWORD PTR [rsp+0xc],edx
   0x000000000001fa4c <+108>:   mov    rdx,QWORD PTR [rbp+0x0]
   0x000000000001fa50 <+112>:   lea    rcx,[rdx-0x4c000405]
   0x000000000001fa57 <+119>:   cmp    rcx,0x40
   0x000000000001fa5b <+123>:   jae    0x1fa77 <mpi_file_set_view__+151>
   0x000000000001fa5d <+125>:   mov    eax,0x1
   0x000000000001fa62 <+130>:   shl    rax,cl
   0x000000000001fa65 <+133>:   test   rax,0x1400001
   0x000000000001fa6b <+139>:   je     0x1fa77 <mpi_file_set_view__+151>
   0x000000000001fa6d <+141>:   mov    DWORD PTR [rsp+0x10],0x4c000809
   0x000000000001fa75 <+149>:   jmp    0x1fa7b <mpi_file_set_view__+155>
   0x000000000001fa77 <+151>:   mov    DWORD PTR [rsp+0x10],edx
   0x000000000001fa7b <+155>:   mov    r9d,DWORD PTR [r9]
   0x000000000001fa7e <+158>:   lea    rbp,[rsp+0x18]
   0x000000000001fa83 <+163>:   movsxd rax,DWORD PTR [rsp+0x68]
   0x000000000001fa88 <+168>:   mov    r8,rbx
   0x000000000001fa8b <+171>:   mov    DWORD PTR [rbp-0x4],r9d
   0x000000000001fa8f <+175>:   push   rax
   0x000000000001fa90 <+176>:   push   rbp
   0x000000000001fa91 <+177>:   lea    rdi,[rsp+0x18]
   0x000000000001fa96 <+182>:   lea    rsi,[rsp+0x10]
   0x000000000001fa9b <+187>:   lea    rdx,[rsp+0x1c]
   0x000000000001faa0 <+192>:   lea    rcx,[rsp+0x20]
   0x000000000001faa5 <+197>:   lea    r9,[rsp+0x24]
   0x000000000001faaa <+202>:   call   0x19600 <pmpi_file_set_view_@plt>
   0x000000000001faaf <+207>:   add    rsp,0x10
   0x000000000001fab3 <+211>:   mov    rdx,QWORD PTR [rsp+0x60]
   0x000000000001fab8 <+216>:   movsxd rax,DWORD PTR [rsp+0x18]
   0x000000000001fabd <+221>:   mov    QWORD PTR [rdx],rax
   0x000000000001fac0 <+224>:   add    rsp,0x28
   0x000000000001fac4 <+228>:   pop    rbp
   0x000000000001fac5 <+229>:   pop    rbx
   0x000000000001fac6 <+230>:   pop    r15
   0x000000000001fac8 <+232>:   pop    r14
   0x000000000001faca <+234>:   pop    r13
   0x000000000001facc <+236>:   pop    r12
   0x000000000001face <+238>:   ret
   0x000000000001facf <+239>:   mov    QWORD PTR [rsp],r9
   0x000000000001fad3 <+243>:   call   0x19100 <ilp64_mpirinitf_@plt>
   0x000000000001fad8 <+248>:   mov    r9,QWORD PTR [rsp]
   0x000000000001fadc <+252>:   mov    DWORD PTR [r15],0x0
   0x000000000001fae3 <+259>:   jmp    0x1fa0e <mpi_file_set_view__+46>
   0x000000000001fae8 <+264>:   nop    DWORD PTR [rax+rax*1+0x0]

The problem, as I understand it, happens on lines "+46" and "+50". The contents of the register `r13` (originally `rsi`, see +30) are interpreted as `int32_t*` and copied to an 8-byte slot at address pointed to by the stack pointer `rsp`. But this register holds an MPI_Offset value (the `disp` parameter), and such narrowing obviously crops the high part of it. Association of some other registers is `rdi` = `fh`, `rdx` = `etype`, `rcx` = `filetype` and `r9` = `info`.

To prove that there is a narrowing of an 8-byte integer to a 4-byte integer I used the MPI profiling interface and wrote an interception implementation of `pmpi_file_set_view_`:

#include <mpi.h>
#include <stdio.h>

void pmpi_file_set_view (int* fh, MPI_Offset* disp, int* etype, int* filetype, const char* datarep, int* info, int* err, int datarep_len);

void pmpi_file_set_view_ (int* fh, MPI_Offset* disp, int* etype, int* filetype, const char* datarep, int* info, int* err, int datarep_len)
{
    printf("[pmpi_file_set_view_] disp: o%llo\n", *disp);

    pmpi_file_set_view(fh, disp, etype, filetype, datarep, info, err, datarep_len);
}

I used it together with a modified reproducer from the very first post:

program mpi_io_offset

    use iso_fortran_env, only: int32, int64
    use mpi

    implicit none

    integer(int32), parameter :: mpiint = kind(MPI_COMM_WORLD)
    integer(int32), parameter :: mpiofs = MPI_OFFSET_KIND

    integer(mpiint) :: ierr, stat(MPI_STATUS_SIZE), one = 1, nelem = int(o'12345671234', mpiint)
    integer(mpiofs) :: zero = 0, bytes
    integer(mpiint), target :: fh

    integer(int64)              :: number = -1
    integer(int64), allocatable :: array(:)

    allocate (array(nelem))
    array = 0
    bytes = nelem
    bytes = bytes * bit_size(number)/8

    call MPI_Init(ierr)
    call MPI_File_open(MPI_COMM_WORLD, 'file.bin', MPI_MODE_CREATE + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ierr)
    call MPI_File_set_size(fh, zero, ierr)
    call MPI_File_set_view(fh, zero, MPI_INTEGER8, MPI_INTEGER8, 'native', MPI_INFO_NULL, ierr)
    call MPI_File_write_all(fh, array, nelem, MPI_INTEGER8, stat, ierr)
    call MPI_File_set_view(fh, bytes, MPI_INTEGER8, MPI_INTEGER8, 'native', MPI_INFO_NULL, ierr)

    print *, ierr

    call MPI_File_write_all(fh, number, one, MPI_INTEGER8, stat, ierr)
    call MPI_File_close(fh, ierr)
    call MPI_Finalize(ierr)

end program mpi_io_offset

This time , the program creates a large (~ 10 GiB) array of 8-byte integers. The byte offset (`disp`) of the second call to `MPI_File_set_view` is (in octal) `0123456712340`, so that one can see well where it is cropped.. I compile the codes with the following Makefile:

all:
	icx -fPIC -c -debug full -O0 -traceback pmpi_file_set_view.c -o pmpi_file_set_view.o
	icx -shared pmpi_file_set_view.o -o libpmpi_file_set_view.so
	ifort -i8 test_intel_mpi.f90 -L/opt/intel/oneapi/mpi/2021.10.0/lib/release -o test_ilp64.x -lmpi_ilp64 -debug full

I can then run the test as

LD_PRELOAD=$(pwd)/libpmpi_file_set_view.so ./test_ilp64.x

The output is

[pmpi_file_set_view_] disp: o0
[pmpi_file_set_view_] disp: o1777777777763456712340
   201385484

Apparently, the low end of the `disp` parameter is passed correctly to the interception function, but not the high end. That one is most likely lost due to a bug in the ILP64-to-LP64 translation in `mpi_file_set_view_` from "libmpi_ilp64.so".