MPI_Scatterv/ Gatherv using C++ with "large" 2D matrices throws MPI errors

Jonas_H_ · ‎05-17-2017

I implemented some `MPI_Scatterv` and `MPI_Gatherv` routines for a parallel matrix matrix multiplication. Everything works fine for small matrix sizes up to N = 180, if I exceed this size, e.g. N = 184 MPI throws some errors while using `MPI_Scatterv`.

For the 2D Scatter I used some constructions with MPI_Type_create_subarray and MPI_TYPE_create_resized. Explanations of these constructions can be found in this question http://stackoverflow.com/questions/9269399/sending-blocks-of-2d-array-in-c-using-mpi.

The minimal example code I wrote filles a matrix A with some values scatters it to the local processes and write the rank number of each process in the local copy of the scattered A. After that the local copies will be gathered to the master process.

    #include "mpi.h"
    
    #define N 184 // grid size
    #define procN 2  // size of process grid
    
    int main(int argc, char **argv) {
        double* gA = nullptr; // pointer to array
        int rank, size;       // rank of current process and no. of processes
    
        // mpi initialization
        MPI_Init(&argc, &argv);
    	MPI_Comm_size(MPI_COMM_WORLD, &size);
    	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
        // force to use correct number of processes
        if (size != procN * procN) {
    		if (rank == 0) fprintf(stderr,"%s: Only works with np = %d.\n", argv[0], procN *  procN);
            MPI_Abort(MPI_COMM_WORLD,1);
        }
    
        // allocate and print global A at master process
        if (rank == 0) {
            gA = new double[N * N];
            for (int i = 0; i < N; i++) {
                for (int j = 0; j < N; j++) {
                    gA[j * N + i] = j * N + i;
    			}
            }
    
            printf("A is:\n");
            for (int i = 0; i < N; i++) {
                for (int j = 0; j < N; j++) {
                    printf("%f ", gA[j * N + i]);
    			}
                printf("\n");
            }
        }
    
        // create local A on every process which we'll process
        double* lA = new double[N / procN * N / procN];
    
        // create a datatype to describe the subarrays of the gA array
        int sizes[2]    = {N, N}; // gA size
        int subsizes[2] = {N / procN, N / procN}; // lA size
        int starts[2]   = {0,0}; // where this one starts
        MPI_Datatype type, subarrtype;
        MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_DOUBLE, &type);
        MPI_Type_create_resized(type, 0, N / procN * sizeof(double), &subarrtype);
        MPI_Type_commit(&subarrtype);
    
        // compute number of send blocks
        // compute distance between the send blocks
        int sendcounts[procN * procN];
        int displs[procN * procN];
    
        if (rank == 0) {
            for (int i = 0; i < procN * procN; i++) {
                sendcounts = 1;
            }
            int disp = 0;
            for (int i = 0; i < procN; i++) {
                for (int j = 0; j < procN; j++) {
                    displs[i * procN + j] = disp;
                    disp += 1;
                }
                disp += ((N / procN) - 1) * procN;
            }
        }
    
        // scatter global A to all processes
        MPI_Scatterv(gA, sendcounts, displs, subarrtype, lA,
                     N*N/(procN*procN), MPI_DOUBLE,
                     0, MPI_COMM_WORLD);
    
        // print local A's on every process
        for (int p = 0; p < size; p++) {
        	if (rank == p) {
        		printf("la on rank %d:\n", rank);
                for (int i = 0; i < N / procN; i++) {
                    for (int j = 0; j < N / procN; j++) {
                        printf("%f ", lA[j * N / procN + i]);
                    }
                    printf("\n");
                }
            }
        	MPI_Barrier(MPI_COMM_WORLD);
        }
        MPI_Barrier(MPI_COMM_WORLD);
    
        // write new values in local A's
        for (int i = 0; i < N / procN; i++) {
            for (int j = 0; j < N / procN; j++) {
                lA[j * N / procN + i] = rank;
            }
        }
    
        // gather all back to master process
        MPI_Gatherv(lA, N*N/(procN*procN), MPI_DOUBLE,
                    gA, sendcounts, displs, subarrtype,
                    0, MPI_COMM_WORLD);
    
        // print processed global A of process 0
        if (rank == 0) {
            printf("Processed gA is:\n");
            for (int i = 0; i < N; i++) {
                for (int j = 0; j < N; j++) {
                    printf("%f ", gA[j * N + i]);
                }
                printf("\n");
            }
        }
    
        MPI_Type_free(&subarrtype);
    
        if (rank == 0) {
            delete gA;
        }
    
        delete lA;
    
        MPI_Finalize();
    
        return 0;
    }

It can be compiled and run using

mpicxx -std=c++11 -o test test.cpp
mpirun -np 4 ./test

For small N=4,...,180 everything goes fine

    A is:
    0.000000 6.000000 12.000000 18.000000 24.000000 30.000000 
    1.000000 7.000000 13.000000 19.000000 25.000000 31.000000 
    2.000000 8.000000 14.000000 20.000000 26.000000 32.000000 
    3.000000 9.000000 15.000000 21.000000 27.000000 33.000000 
    4.000000 10.000000 16.000000 22.000000 28.000000 34.000000 
    5.000000 11.000000 17.000000 23.000000 29.000000 35.000000 
    la on rank 0:
    0.000000 6.000000 12.000000 
    1.000000 7.000000 13.000000 
    2.000000 8.000000 14.000000 
    la on rank 1:
    3.000000 9.000000 15.000000 
    4.000000 10.000000 16.000000 
    5.000000 11.000000 17.000000 
    la on rank 2:
    18.000000 24.000000 30.000000 
    19.000000 25.000000 31.000000 
    20.000000 26.000000 32.000000 
    la on rank 3:
    21.000000 27.000000 33.000000 
    22.000000 28.000000 34.000000 
    23.000000 29.000000 35.000000 
    Processed gA is:
    0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
    0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
    0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
    1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
    1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
    1.000000 1.000000 1.000000 3.000000 3.000000 3.000000

Here you see the errors when I use N = 184:

    Fatal error in PMPI_Scatterv: Other MPI error, error stack:
    PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffee066bad0, displs=0x7ffee066bae0, dtype=USER<resized>, rbuf=0xe9e590, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
    MPIR_Scatterv_impl(205).........: fail failed
    I_MPIR_Scatterv_intra(265)......: Failure during collective
    I_MPIR_Scatterv_intra(259)......: fail failed
    MPIR_Scatterv(141)..............: fail failed
    MPIC_Recv(418)..................: fail failed
    MPIC_Wait(269)..................: fail failed
    PMPIDI_CH3I_Progress(623).......: fail failed
    pkt_RTS_handler(317)............: fail failed
    do_cts(662).....................: fail failed
    MPID_nem_lmt_dcp_start_recv(288): fail failed
    dcp_recv(154)...................: Internal MPI error!  cannot read from remote process
    Fatal error in PMPI_Scatterv: Other MPI error, error stack:
    PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffef0de9b50, displs=0x7ffef0de9b60, dtype=USER<resized>, rbuf=0x21a7610, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
    MPIR_Scatterv_impl(205).........: fail failed
    I_MPIR_Scatterv_intra(265)......: Failure during collective
    I_MPIR_Scatterv_intra(259)......: fail failed
    MPIR_Scatterv(141)..............: fail failed
    MPIC_Recv(418)..................: fail failed
    MPIC_Wait(269)..................: fail failed
    PMPIDI_CH3I_Progress(623).......: fail failed
    pkt_RTS_handler(317)............: fail failed
    do_cts(662).....................: fail failed
    MPID_nem_lmt_dcp_start_recv(288): fail failed
    dcp_recv(154)...................: Internal MPI error!  cannot read from remote process

I found some information abut an issue with MPI_Bcast hang on large user defined types, see (https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large-user-defined-datatypes) but I'm not sure if its the same for Scatterv and Gatherv. I'm using Intel MPI Library 2017 Update 2 for Linux.

I hope someone knows a sollution for this problem.

Michael_Intel · ‎05-24-2017

Hello,

I tried to reproduce your issue without success.

Can you make sure that gA is valid (e.g. not out of memory) before entering the Scatterv operation?

Have you tried different algorithms for the MPI_Scatterv (I_MPI_ADJUST_SCATTERV=1/2)?

Best regards,

Michael