Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2234 Discussions

Cannot Bcast large data(about 6GB) by MPI_Init_thread function

sliu
Beginner
2,252 Views

Hi all,

 

I am a developer using Intel MPI C++ Language with oneAPI version 2022.1, gcc 10.2.0.

I met a wired problem that when I use MPI_Init_thread to init MPI program, I cannot brodcast large data between two nodes. The program hang long time but no output messsege.

But using MPI_Init is alright.

 

How can I brodcast large data normally?
Why did it happen?

 

My compilation command is :

mpiicpc test.cpp

 

My excute command is :

mpirun -np 2 ./a.out

 

Here is my test.cpp code:

 

#include "mpi.h"
#include <cstdlib>
#include <iostream>
#include <climits>


int main(int argc, char* argv[]) {
std::cout << "INT_MAX = " << INT_MAX << std::endl;
int *provided(nullptr);
int myrank,nProcs;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, provided);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
MPI_Comm_size(MPI_COMM_WORLD,&nProcs);
const size_t num_data = 1e8 * 14;

double* B = new double[num_data];
if(myrank == 0){
B[0]=123456;
B[100000]=9876;
B[99999999]=5555;
}
if(myrank != 0){
std::cout<<"data is "<<B[0] <<B[100000]<<B[99999999]<<std::endl;
}
double start, end;
std::cout <<"data size = " << num_data << std::endl;
std::cout <<"start to broadcast" << std::endl;

start = MPI_Wtime();

//MPI_Bcast((void*)B, 1, newtype, 0, MPI_COMM_WORLD);

MPI_Bcast((void*)B, num_data, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(myrank != 0){
std::cout<<"data is "<<B[0] <<B[100000]<<B[99999999]<<std::endl;
}
end = MPI_Wtime();
printf("Runtime = %f\n", end-start);
delete[] B;
MPI_Finalize();
return 0;
}

    

Labels (2)
0 Kudos
1 Solution
SantoshY_Intel
Moderator
2,145 Views

Hi,

 

We tried with the latest Intel MPI 2021.8 and we did not encounter any issues, as demonstrated in the following output::

$ mpirun -bootstrap ssh -n 2 -ppn 1 -check_mpi ./hang
INT_MAX = 2147483647
INT_MAX = 2147483647

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

data is 000
data size = 1400000000
start to broadcast
data size = 1400000000
start to broadcast
data is 12345698765555
Runtime = 6.894417
Runtime = 8.114953

[0] INFO: Error checking completed without finding any problems.

So, we recommend you try the latest Intel MPI version to resolve the issue. If you still face any issues then please provide us with your environment details such as Operating system, CPU, Job scheduler, Interconnect hardware & fabric provider. Providing this information will help us gain a deeper understanding of your system configuration and help us identify any potential issues more effectively.

 

Thanks & Regards,

Santosh

 

View solution in original post

0 Kudos
5 Replies
SantoshY_Intel
Moderator
2,210 Views

Hello,

 

Thank you for posting in the Intel communities.

 

We recommend you use the supported version of the Intel oneAPI HPC Toolkit, which is specified in the following link:

https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-parallel-studio-xe-supported-and-unsupported-product-versions.html

 

Additionally, we noticed that there is an issue with your sample code. To confirm this, you can use the following command:

mpirun -np 2 -check_mpi ./a.out

 

To address the issue, we suggest that you modify a single line of your code. You can replace line 9 with the following statement:

int *provided= new int();

(or)

Alternatively, you can replace lines 9 and 11 with the following code:

line 9: int provided;

line 11: MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

 

Please make the necessary changes and let us know if you encounter any further issues.

 

Thanks & Regards,

Santosh

 

 

 

0 Kudos
sliu
Beginner
2,166 Views

Hi Santosh,

     First of all, thanks for your reply and suggestions. I have changed my code about variable "provided".

 

     Then , I tried -check_mpi option in my mpirun command. And I got new information about deadlock like bellow:

 

INT_MAX = 2147483647
INT_MAX = 2147483647

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

data is 000
data size = 1400000000
start to broadcast
data size = 1400000000
start to broadcast
[0] ERROR: no progress observed in any process for over 1:09 minutes, aborting application
[0] WARNING: starting premature shutdown

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR: Application aborted because no progress was observed for over 1:09 minutes,
[0] ERROR: check for real deadlock (cycle of processes waiting for data) or
[0] ERROR: potential deadlock (processes sending data to each other and getting blocked
[0] ERROR: because the MPI might wait for the corresponding receive).
[0] ERROR: [0] no progress observed for over 1:09 minutes, process is currently in MPI call:
[0] ERROR: MPI_Bcast(*buffer=0x2b0f38000010, count=1400000000, datatype=MPI_DOUBLE, root=0, comm=MPI_COMM_WORLD)
[0] ERROR: main (/public1/wshome/ws39/sc81844/ls/sc81798/test.cpp:36)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/public1/wshome/ws39/sc81844/ls/sc81798/a.out)
[0] ERROR: [1] no progress observed for over 1:09 minutes, process is currently in MPI call:
[0] ERROR: MPI_Bcast(*buffer=0x2b83c4000010, count=1400000000, datatype=MPI_DOUBLE, root=0, comm=MPI_COMM_WORLD)
[0] ERROR: main (/public1/wshome/ws39/sc81844/ls/sc81798/test.cpp:36)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/public1/wshome/ws39/sc81844/ls/sc81798/a.out)

[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

 

    May I ask if this question is related to MPI_ Init_ thread function relevant?

    Because I use MPI_Init instead of MPI_Init_thread, the program will execute normally. But my real project code is multi-thread version, I need MPI_Init_thread.

 

    Thanks for your reply again and Looking forward to your reply!

0 Kudos
SantoshY_Intel
Moderator
2,146 Views

Hi,

 

We tried with the latest Intel MPI 2021.8 and we did not encounter any issues, as demonstrated in the following output::

$ mpirun -bootstrap ssh -n 2 -ppn 1 -check_mpi ./hang
INT_MAX = 2147483647
INT_MAX = 2147483647

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

data is 000
data size = 1400000000
start to broadcast
data size = 1400000000
start to broadcast
data is 12345698765555
Runtime = 6.894417
Runtime = 8.114953

[0] INFO: Error checking completed without finding any problems.

So, we recommend you try the latest Intel MPI version to resolve the issue. If you still face any issues then please provide us with your environment details such as Operating system, CPU, Job scheduler, Interconnect hardware & fabric provider. Providing this information will help us gain a deeper understanding of your system configuration and help us identify any potential issues more effectively.

 

Thanks & Regards,

Santosh

 

0 Kudos
sliu
Beginner
2,132 Views

Hi Santosh,

    Thanks for your help! I will try latest Intel MPI version to resolve this issue.

0 Kudos
SantoshY_Intel
Moderator
2,058 Views

Hi,


Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Santosh


0 Kudos
Reply