Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

mixing Intel MPI and TBB

Hyokun_Y_
Beginner
900 Views

I have been using a mixture of MPICH2 and TBB very successfully:
MPICH2 for machine-to-machine communication and TBB for inter-machine thread management.

Now, I am trying the very same code in the system which uses Intel MPI instead of MPICH2,
and I am observing a very odd behavior; some messages sent with MPI_Ssend is not being received
in the destination, and I am wondering whether it is because Intel MPI and TBB does not work well
together.

The following document

http://software.intel.com/en-us/articles/intel-mpi-library-for-linux-product-limitations

says the environment variable I_MPI_PIN_DOMAIN has to be set properly when 
OpenMP and Intel MPI are used together; when TBB instead of OpenMP is used with
Intel MPI, is there anything I should be careful about? Is this combination
guaranteed to work?

 

Thanks,
Hyokun Yun 

 

0 Kudos
12 Replies
Hyokun_Y_
Beginner
900 Views

I have attached a simple test program which mixes TBB with Intel MPI. This worked perfectly fine in the previous cluster which uses MPICH2, but in a new cluster with Intel MPI, some messages are never delivered and thus blocking-send never completes.

0 Kudos
Hyokun_Y_
Beginner
900 Views

[cpp]

#include <iostream>
#include <utility>

#include "tbb/tbb.h"
#include "tbb/scalable_allocator.h"
#include "tbb/tick_count.h"
#include "tbb/spin_mutex.h"
#include "tbb/concurrent_queue.h"
#include "tbb/pipeline.h"
#include "tbb/compat/thread"
#include <boost/format.hpp>

using namespace std;
using namespace tbb;


int main(int argc, char **argv) {

// initialize TBB
tbb::task_scheduler_init init();

// initialize MPI
int numtasks, rank, hostname_len;
char hostname[MPI_MAX_PROCESSOR_NAME];

int mpi_thread_provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_thread_provided);

if (mpi_thread_provided != MPI_THREAD_MULTIPLE) {
cerr << "MPI multiple thread not provided!!! " << mpi_thread_provided << " " << MPI_THREAD_MULTIPLE << endl;
return 1;
}

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Get_processor_name(hostname, &hostname_len);

cout << boost::format("processor name: %s, number of tasks: %d, rank: %d\n")
% hostname % numtasks % rank;


// run program for 10 seconds
double RUN_SEC = 10;
// size of message
int MBUFSIZ = 100;

tick_count start_time = tick_count::now();

// receive thread: keep receiving messages from any sources
thread receive_thread([&]() {
int monitor_num = 0;
double elapsed_seconds;

int data_done;
MPI_Status data_status;
MPI_Request data_request;

char recvbuf[MBUFSIZ];

MPI_Irecv(recvbuf, MBUFSIZ, MPI_CHAR,
MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &data_request);

while(true) {
elapsed_seconds = (tbb::tick_count::now() - start_time).seconds();

if (monitor_num < elapsed_seconds + 0.5) {
cout << "rank: " << rank << ", receive thread alive" << endl;
monitor_num++;
}

if (elapsed_seconds > RUN_SEC + 5.0) {
break;
}

MPI_Test(&data_request, &data_done, &data_status);
if (true == data_done) {
cout << "rank: " << rank << ", message received!" << endl;
MPI_Irecv(recvbuf, MBUFSIZ, MPI_CHAR,
MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &data_request);

}

}

MPI_Cancel(&data_request);

cout << "rank: " << rank << ", recv thread dying!" << endl;

return;
});

// send thread: send one (meaningless) message to (rank + 1) every second
thread send_thread([&]() {
int monitor_num = 0;
double elapsed_seconds;

char sendbuf[MBUFSIZ];
fill_n(sendbuf, MBUFSIZ, 0);

while (true) {
elapsed_seconds = (tbb::tick_count::now() - start_time).seconds();

if (monitor_num < elapsed_seconds) {
cout << "rank: " << rank << ", start sending message" << endl;
monitor_num++;

MPI_Ssend(sendbuf, MBUFSIZ, MPI_CHAR,
(rank + 1) % numtasks, 1, MPI_COMM_WORLD);

cout << "rank: " << rank << ", send successfully done!" << endl;

}

if (elapsed_seconds > RUN_SEC) {
break;
}
}

cout << "rank: " << rank << ", send thread dying!" << endl;

return;
});

receive_thread.join();
send_thread.join();

return 0;

}

[/cpp]

0 Kudos
TimP
Honored Contributor III
900 Views

Did you take care to adjust the environment settings according to your intended method of scheduling?  I'm guessing with MPICH2 you have have left scheduling to the OS.  If using MPI_THREAD_FUNNELED mode, you can easily set the Intel environment variable to get multiple hardware threads per rank, not relying on MPI to understand tbb as it does OpenMP.  I believe then you can explicitly affinitize the tbb threads to that rank, but I don't know the details.  I suppose if you have forced threads from different ranks to use the same hardware resources, deadlock should not be a surprise.

0 Kudos
Hyokun_Y_
Beginner
900 Views

Thank you very much for the reponse!

TimP (Intel) wrote:

If using MPI_THREAD_FUNNELED mode, you can easily set the Intel environment variable to get multiple hardware threads per rank, not relying on MPI to understand tbb as it does OpenMP.  I believe then you can explicitly affinitize the tbb threads to that rank, but I don't know the details.  

I am using MPI_THREAD_MULTIPLE, but I guess what you are saying is still relevant? I tried I_MPI_PIN=off, but it did not help.

TimP (Intel) wrote:

I suppose if you have forced threads from different ranks to use the same hardware resources, deadlock should not be a surprise. 

Actually I am using a linux cluster and every rank is assigned to a different machine. But do you still think for specific setting of environment variables deadlock would happen? Note that in the example above, I am using only 2 threads which does MPI_Ssend and MPI_Recv respectively. Therefore there are at maximum only 4 possible threads running - i have 16 cores -, so I thought I had enough amount of resource.

I have just implemented OpenMP version of the code, and I am experiencing the same problem; it works fine with MPICH2 but not with Intel MPI (I can share the code if anyone wants). I tried I_MPI_PIN_DOMAIN=omp and I_MPI_PIN=off, but it did not help. Is there any other environment variable to adjust? Any comments are appreciated! 

0 Kudos
James_T_Intel
Moderator
900 Views

Hi Hyokun,

I compiled and ran your test program here using Intel® MPI Library Version 4.1.0.030, and Intel® Threading Building Blocks Version 4.1.3.163.  I am not seeing any deadlocks, with all settings at default.  What is the output you get from

[plain]env | grep I_MPI[/plain]

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Hyokun_Y_
Beginner
900 Views

Hi James, 

Thanks for the response! Does my program terminate properly? On several clusters I tried, it does not terminate since only one thread activates at a time. Note that (approximately) 10 messages have to be sent by each machine at the end of the execution.

By default I have only 

I_MPI_FABRICS=shm:tcp

I tried I_MPI_PIN=off but it did not help. I am using impi/4.1.0.024/ and composer_xe_2013.3.163 (icpc 13.1.1.163).

Thanks,
Hyokun Yun 

0 Kudos
James_T_Intel
Moderator
900 Views

Hi Hyokun,

How many nodes are you using?  I was only using two nodes, and testing up to 64 ranks per node (in case oversubscribing led to the problem).  I'm going to try with TCP and see if that causes it to hang.  I'm also going up to 8 nodes.

As far as I can tell, your program completed successfully.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Hyokun_Y_
Beginner
900 Views

James, thanks for taking care of this seriously.

James Tullos (Intel) wrote:

How many nodes are you using?

I tried using 2 nodes and 4 nodes.

I reproduced this problem in two different clusters in different institutions, so I think this is not the hardware-specific issue. Would you please let me know which version of Intel MPI and TBB you are using? (probably the most recent?)

From another post on TBB forum, another person was able to reproduce the problem, and he told me there is a issue on multi-threading of Intel MPI: http://software.intel.com/en-us/forums/topic/392226 Do you happen to know anything about this issue? I was wondering whether you were using the fixed version.

Thanks,
Hyokun Yun 

0 Kudos
James_T_Intel
Moderator
900 Views

Hi Hyokun,

Let me check with Roman to find out which fix he is talking about, and I'll test with that build.

In my testing, I try to stick to versions that are publicly released, so as to more accurately reproduce what a customer would see.  In this case, I am using IMPI 4.1.0.030 and the latest TBB released with the 2013.3 Composer XE.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
James_T_Intel
Moderator
900 Views

Hi Hyokun,

Please try upgrading to Intel® MPI Library 4.1.0.030.  This issue should correct the problem you are seeing.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Hyokun_Y_
Beginner
900 Views

Switching from 4.1.0.024 to 4.1.0.030 indeed fixed the problem. Thanks very much!

Best,
Hyokun Yun 

0 Kudos
James_T_Intel
Moderator
900 Views

Hi Hyokun,

Great!  I'm glad it's working now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Reply