Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
176 Views

Why is it slower to access memory through MPI_Win_shared_allocate

Please have a look at this simple code:

 

Why is the accessing and filling up of the shared buffer 1000x slower than through the raw pointer?

 

#include <mpi.h>
#include<stdlib.h>
int main( int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Win win;
MPI_Comm MPI_COMM_SHM;
int rank;
long int mb = 0;
double* bufptr;

MPI_Comm_rank( MPI_COMM_WORLD, &rank );
int MB = 10000;
if(rank==0)mb = MB*1e6/sizeof(double);

MPI_Comm_split_type( MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &MPI_COMM_SHM);

MPI_Aint size_bytes,size_of_T, lb;

MPI_Info infoAlloc;
MPI_Info_create( &infoAlloc );
MPI_Info_set( infoAlloc, "alloc_shared_noncontig", "true" );
MPI_Type_get_extent(MPI_DOUBLE_PRECISION, &lb, &size_of_T);

size_bytes = mb*size_of_T;
int disp_unit = size_of_T;
printf("%ld %ld %d\n", size_bytes, mb, size_of_T);
MPI_Win_allocate_shared( size_bytes, disp_unit, infoAlloc
, MPI_COMM_SHM, &bufptr, &win);

double *rbuf;
MPI_Aint rsize;
int disp;
if(rank != 0)
{
MPI_Win_shared_query(win, 0, &rsize, &disp, &rbuf);
printf("rbuf %d %ld\n", rank, rsize);
}



if(rank==0)
{
double t1= MPI_Wtime();
for(long int i = 0; i<mb; i++)
{
bufptr[i] = (double) (rank+1)*i;
}
double t2 = MPI_Wtime();
printf("%e\n", t2-t1);
}


MPI_Win_free(&win);

if(rank==0)
{
double* test_;
test_ = (double*) malloc(sizeof(double)*mb);
double t1= MPI_Wtime();
for(long int i = 0; i<mb; i++)
{
test_[i] = (double) (rank+1)*i;
}
double t2= MPI_Wtime();
printf("%e\n", t2-t1);
double sum_;
for(long int i = 0; i<mb; i++)
{
sum_ += test_[i];
}
printf("__ %e\n\n", sum_);
}

MPI_Finalize();
}


 

0 Kudos
6 Replies
Highlighted
Moderator
154 Views

Hi,

Thanks for reaching out to us!

Could you please share details about your system environment and commands which you have used to compile and run the code?

Please provide the below details:

No. of ranks on which you running the code:

Across how many nodes you are launching the executable:

MPI version:

Interconnect details:

output of the code on your system


Regards

Goutham


0 Kudos
Highlighted
146 Views

This may be a "first touch" issue.

Please test:

1) allocate buffer(s) once
2) run timed loop (at least) twice

Check the time(s) of the 2nd (and later) loops.

"first touch": Allocation obtains addresses in virtual memory (but no physical RAM, nor storage in the Page FIle). Subsequent to VM allocation, first time the process touches (write or read) of a VM page, page fault occurs, the O/S then allocates a page size number of blocks in the page file (migh wipe them too) and obtains a page size number of bytes from RAM (which may require paging some other processor's data out to the page file).

Jim Dempsey

0 Kudos
Highlighted
Beginner
113 Views

Hi

 

I have used various compilers including the intel, I have used multiple mpi including intel-mpi. I only run on 1 node since this is about testing the shared memory stuff.  Running on intel gold with 40 physical cores. 

All of this does not matter really. I also used my own laptop with intel icore9.  It would take you two seconds to run the code I put above and you will realise the same outcome as mine observations !

 

The problem is more about the page size. It seems that malloc can be flexible with page sizes , so if you allocate a large Chunk of memory it will try to do something to the page size. 

 

However, when using MPI's MPI_win_shared_allocate, it looks like that the page size is fixed. 

 

I have my info from this paper !

 

https://www.mcs.anl.gov/~balaji/pubs/2015/ppmm/ppmm15.stencil.pdf

 

 

Is this something you have in your plans to do something about  or have any advice to what to do without having to play with MPI source code itself - or the actual Linux system. (For HPCs the latter is not ideal as we don't normally have root access, while the former is impossible if you use intel MPI)

0 Kudos
Highlighted
Beginner
110 Views

@jimdempseyatthecove I understand why you are suggesting this. I did put a loop over the process of filling/accessing the buffers up to 20 times. It did not change the observation, I then read the paper I referred to in my last thread. That seemed to explain why, namely for large buffers, you get a lot of page misses when using arrays allocated through MPI's API.

0 Kudos
Highlighted
Beginner
60 Views

@GouthamK_Intel I saw your reply on email, but it does not seem to appear here in this thread. 

But thanks for trying to run my small code and glad to hear you can reproduce my problem. Please keep me updated with the progress of the developers.

 

Thanks

Ali

0 Kudos
Highlighted
Moderator
30 Views

Hi @Ali Thari

We have gone through the source code which you have provided, we observed that you are using MPI_Win_allocate_shared call to allocate shared memory for the processes in a node but you are using only a single rank to compute the data. Similarly, you are creating a buffer with malloc call for a single rank and computing the data.

But the functionality/usability of both calls is different. When you create buffer using Malloc call for a rank then that buffer is restricted to be used by that rank only and no other ranks can access that buffer.

Whereas MPI_win_allocate_shared call is a collective call executed by all processes in the group of comm. On each process, it allocates memory of at least size bytes that is shared among all processes in comm, and returns a pointer to the locally allocated segment in baseptr that can be used for load/store accesses on the calling process. The locally allocated memory can be the target of load/store accesses by remote processes.

For more information on MPI_win_allocate_share call refer the below pdf; Page: 407: Window That Allocates Shared Memory

https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

As the functionality of both the calls is different their performance shouldn't be compared.


Regards

Goutham


0 Kudos