MPI_Comm_rank , MPI_THREAD_MULTIPLE, and performance

B___Christoph · ‎03-30-2016

Hi everyone,

We found the following behavior in Intel MPI (5.0.3) using both the intel compilers and gcc:

In an OpenMP-MPI environment, the performance of MPI_Comm_rank goes down if MPI is initialized using MPI_THREAD_MULTIPLE. I attach two files to show the behavior. They can be compiled with

mpiicpc main.cpp -o test.exe -openmp
mpiicpc main2.cpp -o test_nothreads.exe -openmp

Both executables do a simple parallelized for loop two times; the first time, an arithmetic operation is performed a lot of times. The second time, there are additional calls to MPI_Comm_rank within the loop.

test.exe uses MPI_THREAD_MULTIPLE. Here is a typical example of the runtime (with one thread, OMP_NUM_THREADS=1) for the two loops:

MPI_THREAD_MULTIPLE w/o rank: 0.0411851
MPI_THREAD_MULTIPLE w rank: 1.03309

test_no_threads.exe doesn't use MPI_THREAD_MULTIPLE, and we get:

w/o rank: 0.0452909
w rank: 0.181268

This slowdown gets a lot more severe if we do this with e.g. 16 OpenMP threads:

MPI_THREAD_MULTIPLE w rank: 6.07238
versus
w rank: 0.345186

Using a profiler we find that there is spin lock in MPI_Comm_rank that is responsible for the slowdown.

I see that with MPI_THREAD_MULTIPLE, there needs to be some locking for MPI operations. However, I do not see why this should be the case in MPI_Comm_rank, since I assume this to be a rather local operation - e.g. in OpenMPI, this is internally just returning a member of a struct, namely the process ID.
Therefore, I would like to understand if this is a known problem or a bug.

All the best, Christoph.

I can not attach cpp files for some reason, so here is just the code:

main2.cpp:

#include "mpi.h"
#include "omp.h"

#include <iostream>
#include "math.h"
#include "stdlib.h"

int main(int argc,char* args[])
{
	MPI_Init(NULL,NULL);
	long n=1000000;
	double start = MPI_Wtime();
	double *d = new double;
	double *d2 = new double;

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
        
		d2 = cos(d)*pow(d,3.0);


	}
	delete[] d;
	delete[] d2;


	double end1 = MPI_Wtime();
	std::cout << "w/o rank: " << end1-start << std::endl;

	d = new double;
	d2 = new double;

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
		int myProcID;
		for(int j=0;j<10;j++)
	          MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
		d2 = cos(d)*pow(d,3.0);


	}
	
	double end2 = MPI_Wtime();
	std::cout << "w rank: " << end2-end1 << std::endl;
	MPI_Finalize();

	return 0;

}

main.cpp:

#include "math.h"
#include "mpi.h"
#include "omp.h"

#include <iostream>
#include "stdlib.h"

int main(int argc,char* args[])
{
	int required = MPI_THREAD_MULTIPLE;
	  int provided = 0;
	    MPI_Init_thread(NULL,NULL,required,&provided);
	      if(provided!=required)
		        {
				      std::cout << "Error: MPI thread support insufficient! required " << required << " provided " << provided;
   			            abort();
					          
		        }
	long n=1000000;
	double start = MPI_Wtime();
	double *d = new double;
	double *d2 = new double;

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
		
		d2 = cos(d)*pow(d,3.0);


	}
	delete[] d;
	delete[] d2;


	double end1 = MPI_Wtime();
	std::cout << "MPI_THREAD_MULTIPLE w/o rank: " << end1-start << std::endl;

	d = new double;
	d2 = new double;

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
		int myProcID;
		for(int j=0;j<10;j++)
	          MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
		d2 = cos(d)*pow(d,3.0);


	}
	
	double end2 = MPI_Wtime();
	std::cout << "MPI_THREAD_MULTIPLE w rank: " << end2-end1 << std::endl;
	MPI_Finalize();

	return 0;

}

Michael_Intel · ‎04-01-2016

Hi Christoph,

I will have a look at the issue and get back to you soon.

Best regards,

Michael

Michael_Intel · ‎04-01-2016

Hi Christoph,

I could reproduce the behaviour that you described and indeed there is a lock within the local MPI_Comm_rank operation which does not involve communication with other ranks.

This however is an intentional restriction coming from the MPICH code base and wont be changed since the performance impact is rather artificial and wont affect real-world applications.

Best regards,

Michael

B___Christoph · ‎04-01-2016

Hi Michael,

Thank you for your reply. I agree with you that at least in the cases I can think of, such a call can always be moved outside of a loop and e.g. stored locally. However, I found this behavior in a real-world application, yielding a slowdown of about a factor of 10, so I do not agree that the performance impact is completely artificial. I found it rather unexpected that this call makes such a difference; it might be a good idea to have a note on this somewhere.
Besides, it would be very interesting for me to understand what the reason for this restriction is. Portability?

Best regards,

Christoph.

Michael_Intel · ‎04-04-2016

Hi Christoph,

The reason for this implementation style is mainly safety. Since an MPI library supporting the MPI_THREAD_MULTIPLE mode, has to implement some locking mechanism in certain places, the locking is just kept to protect functions in general. As you already pointed out, it is not strictly required in all places, but rather kept for safety reasons allowing further implementation changes without violating the MPI standard.

Best regards,

Michael