Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1959 Discussions

Porting code from MPI/Pro 1.7 to Intel MPI 3.1

I am in the process of switching from MPI/Pro 1.7 to Intel MPI 3.1 and I am seeing very strange (and poor) performance issues that have stumped me.

I am seeing poor performance throughout the entire code, but the front end is a good illustration of some of the problems I am seeing. The front end consists of two processes, process 0 (or I/O process) reads in a data header and data and passes them to process 1 (or compute process). Process 1 then processes the data and sends the output field(s) back to process 0 which saves them to disk.

Here is the outline of the MPI framework for the two processes for the simple case of 1 I/O process and 1 compute process:

Process 0:

For (ifrm=0; ifrm <= totfrm; ifrm++;) {

if (ifrm != totfrm) {
data_read (..., InpBuf, HD1,...);
MPI_Ssend (HD1,...);
MPI_Ssend (InpBuf,...);

if (ifrm > 0) {
MPI_Recv (OutBuf,...);
sav_data (OutBuf,...);

} // for (ifrm=0...

// No more data, send termination message
MPI_Send (MPI_BOTTOM, 0, ...);

Process 1:

// Initialize persistent communication requests
MPI_Recv_init (HdrBuf, ..., req_recvhdr);
MPI_Recv_init (InpBuf, ..., req_recvdat);
MPI_Ssend_init (OutBuf, ..., req_sendout);

// Get header and data for first frame
MPI_Start (req_recvhdr);
MPI_Start (req_recvdat);

while (1) {

MPI_Wait (req_recvhdr, status);
MPI_Get_Count (status, count);
if (count = 0) {
execute termination code

MPI_Wait (req_recvdat);

// Start receive on next frame while processing current one
MPI_Start (req_recvhdr);
MPI_Start (req_recvdat);

process data

if (curr_frame > start_frame) {
MPI_Wait (req_sendout);

process data

// Send output field(s) back to I/O process
MPI_Start (req_sendout);

} // while (1)

The problem I am having is that the MPI_Wait calls are chewing up a lot CPU cycles for no obvious reason and in a very erratic way. When using MPI/Pro, the above MPI framework works in a very reliable and predicable way. However, with Intel MPI, the code can spend almost no time (expected) or several minutes (very unexpected) on one of the MPI_Wait calls. The two waits that are giving me the most problems are the ones associated with req_recvhdr and req_sendout.

The code is compiled using the 64-bit versions of the Intel compiler 10.1 and Intel MKL 10.0 and is run on RHEL4 nodes. Both processes are run on the same core.

Like I have already said, this framework works well under MPI/Pro and I am stumped in terms of locating the problem(s) or what things I should try in order to fix the code. Any insight or guidance you could provide would be greatly appreciated.
0 Kudos
2 Replies
Hi jburri,

Thanks for posting to the Intel HPC forums and welcome!

You probably need to use wait-mode. Please try to set environment variable I_MPI_WAIT_MODE to 'on'.
Also you could try to set env variable I_MPI_RDMA_WRITE_IMM to 'enable'.
And you could play with I_MPI_SPIN_COUNT variable setting different values.

Best wishes,

Thanks Dmitry. I will play around with those paramter and see what the impact is on performance.