Operating system and version: CentOS Linux release 7.5.1804
Intel MPI version: 2019.5.281
Compiler and version: 22.214.171.1241
Fabric: Mellanox Technologies MT27500
Libfabric version: 1.7.2
I have a quick question regarding Intel MPI and large (> 1GB) files created by MPI at runtime. We maintain part of a large code and the standard test suite for this code imposes a limit on file size to keep the tests small. File size is restricted using "limit filesize 1024m" (csh), but I also tested with "limit -f 1024000" (BASH) with the same results. When I start mpirun for any code requesting more than a single core, I apparently exceed this filesize limit and the code crashes. A minimal example:
program hello include 'mpif.h' integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) print*, 'node', rank, ': Hello world' call MPI_FINALIZE(ierror) end
This runs fine when called as follows:
#!/bin/bash ulimit -f 1024000 nproc=1 mpirun -n $nproc -ppn $nproc ./mpi-test
It also runs fine without the ulimit and with nproc increased, but crashes with the ulimit and nproc >= 2 with the following error:
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 78358 RUNNING AT pc-beethoven.cluster = KILLED BY SIGNAL: 25 (File size limit exceeded) =================================================================================== =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 78359 RUNNING AT pc-beethoven.cluster = KILLED BY SIGNAL: 9 (Killed) ===================================================================================
As this minimal example doesn't read or write any files, I guess the files must belong to mpirun. Is this "normal" behavior with IMPI (something I need to contact the main developers about), or does it indicate a problem with our IMPI installation?
Thanks in advance for any help!
I have also faced the same issue after setting the limits, for single-core its working fine but more than 2 its give me the same error, there is no installation problem I guess.
We will get back to you after debugging it.
Thanks, that's good to know. In case it helps at all for comparison, the minimum value for "limit filesize" as a function of number of cores that we get for this example is:
1 core: limit filesize 2
2 cores: limit filesize 1661460
4 cores: limit filesize 1745940
The required filesize that you are observing is part of our shared memory fabric / user memory implementation and allocating a shared memory heap from /dev/shm. Therefore it actually works as intended.
You could theoretically disable it by switching off the shared memory implementation and also disabling the shared memory for user allocations.:
However you would also loose shared memory communication performance.