Community
cancel
Showing results for 
Search instead for 
Did you mean: 
sebastien_c_2
Beginner
71 Views

MPI_send/recv odd behavior

Hi all,

I am new here but following advise from Intel, I ask my question here.

I use the intel MPI beta version (2017.0.042). I have some codes that run locally. Everything works well. But in at least one case, I get an odd behavior. Inside my code, I do a first send/recv to get the data size and then I send/recv data. Now when the size is small, everything works fine. But when I want to send more than 10.000 doubles, I get an infinite loop. Using GDB in the following way on two MPI_proc, I do a Ctrl+C and looking at the backtrace, I get something so weird. 

mpirun -n 2 xterm -hold -e gdb --args ./foo -m <datafilename>

The sketch is to send from process 1 to process 0, as a small reduction. In that configuration, the destination is 0 and the source is 1. But from the backtrace, this informations are corrupted i.e. source = -1. This explains the infinite loop. Moreover the tag variable, setup to 0, move to another value.

So, my idea is that there might be a bufferoverflow. To be sure, I switch to MPICH 3.2. And now, everything works fine.

 

Finally, following advise of Gergana, I have looked at the troubleshooting and try few ideas. One more time, I got an odd behavior :using an option as follow, it fixes the bug (https://software.intel.com/fr-fr/node/535596)

mpirun -n 2 -genv I_MPI_FABRICS tcp ./foo -m <datafilename>

Well, my question is finally I would like to get some help, some information and/or some explanation about that. Is it bug coming from my usage of I_MPI ?

Thank you in advance for taking time to read me.

Sebastien

PS: additional informations : laptop Asus UX31 with Ubuntu 14.04 LTS and Intel® Core™ i5-2557M CPU @ 1.70GHz × 4 

0 Kudos
2 Replies
71 Views

Hi Sebastien,

Do you also have Intel Trace Analyzer and Collector [ITAC] installed (e.g. as part of Parallel Studio XE 2017 beta)? I yes, I suggest to run Intel MPI with correctness checking and look for any error or warning in the output. Using correctness checking is very easy if you have either sourced the environment from psxevars.sh or itacvars.sh (found in the installation paths of the tools). Then you need only the extra flag -check_mpi:

mpirun -check_mpi ...

If the code runs perfectly well you will finally see
[0] INFO: Error checking completed without finding any problems.

Or you will see hints about errors or warnings.
In case of a deadlock the code will be stopped automatically after about 60s.

See also https://software.intel.com/en-us/node/528771 and https://software.intel.com/en-us/node/561293

Best regards
Klaus-Dieter

sebastien_c_2
Beginner
71 Views

Hi Klaus-Dieter,

Thank you for taking time to answer me. I just try your idea. It seems, there is a problem. Here, I copy paste the result of the execution.

mpirun -check_mpi -n 2 ./callTSQR_mkl -m rdb800l.mtx 

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

PARAM 	-m rdb800l.mtx	with-mkl	without-lapack
Vector Starting row number of each block of size 3 : 0 400 800 
Matrix 800x800 with 4640 nnz	local data : 400x800 with 2320 lnnz Structure = Unsymmetric
Matrix 800x800 with 4640 nnz	local data : 400x800 with 2320 lnnz Structure = Unsymmetric
[0] ERROR: no progress observed in any process for over 1:08 minutes, aborting application
[0] WARNING: starting premature shutdown

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR:    Application aborted because no progress was observed for over 1:08 minutes,
[0] ERROR:    check for real deadlock (cycle of processes waiting for data) or
[0] ERROR:    potential deadlock (processes sending data to each other and getting blocked
[0] ERROR:    because the MPI might wait for the corresponding receive).
[0] ERROR:    [0] no progress observed for over 1:08 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Recv(*buf=0x7fac0fd8e010, count=320000, datatype=MPI_DOUBLE, source=1, tag=0, comm=MPI_COMM_WORLD, *status=0x7ffe442e7a20)
[0] ERROR:       TSQR (../bin/callTSQR_mkl)
[0] ERROR:       main (../bin/callTSQR_mkl)
[0] ERROR:       __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[0] ERROR:       (../bin/callTSQR_mkl)
[0] ERROR:    [1] no progress observed for over 1:08 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Send(*buf=0x136f300, count=320000, datatype=MPI_DOUBLE, dest=0, tag=0, comm=MPI_COMM_WORLD)
[0] ERROR:       TSQR (../bin/callTSQR_mkl)
[0] ERROR:       main (../bin/callTSQR_mkl)
[0] ERROR:       __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[0] ERROR:       (../bin/callTSQR_mkl)

[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 14676 RUNNING AT UX31E
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

If you understand what is going on. Let me know :)

At the same time, I'll look at your links. Thank you.

Best,

Sebastien

PS : I have ITAC and have set the whole variables (manually since I use module load environment)

Reply