Re: Re:Why is it slower to access memory through MPI_Win_shared_allocate

AThar2 · ‎11-10-2020

Please have a look at this simple code:

Why is the accessing and filling up of the shared buffer 1000x slower than through the raw pointer?

#include <mpi.h>

#include<stdlib.h>

int main( int argc, char *argv[])

{

MPI_Init(&argc, &argv);

MPI_Win win;

MPI_Comm MPI_COMM_SHM;

int rank;

long int mb = 0;

double* bufptr;

MPI_Comm_rank( MPI_COMM_WORLD, &rank );

int MB = 10000;

if(rank==0)mb = MB*1e6/sizeof(double);

MPI_Comm_split_type( MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &MPI_COMM_SHM);

MPI_Aint size_bytes,size_of_T, lb;

MPI_Info infoAlloc;

MPI_Info_create( &infoAlloc );

MPI_Info_set( infoAlloc, "alloc_shared_noncontig", "true" );

MPI_Type_get_extent(MPI_DOUBLE_PRECISION, &lb, &size_of_T);

size_bytes = mb*size_of_T;

int disp_unit = size_of_T;

printf("%ld %ld %d\n", size_bytes, mb, size_of_T);

MPI_Win_allocate_shared( size_bytes, disp_unit, infoAlloc

, MPI_COMM_SHM, &bufptr, &win);

double *rbuf;

MPI_Aint rsize;

int disp;

if(rank != 0)

{

MPI_Win_shared_query(win, 0, &rsize, &disp, &rbuf);

printf("rbuf %d %ld\n", rank, rsize);

}

if(rank==0)

{

double t1= MPI_Wtime();

for(long int i = 0; i<mb; i++)

{

bufptr[i] = (double) (rank+1)*i;

}

double t2 = MPI_Wtime();

printf("%e\n", t2-t1);

}

MPI_Win_free(&win);

if(rank==0)

{

double* test_;

test_ = (double*) malloc(sizeof(double)*mb);

double t1= MPI_Wtime();

for(long int i = 0; i<mb; i++)

{

test_[i] = (double) (rank+1)*i;

}

double t2= MPI_Wtime();

printf("%e\n", t2-t1);

double sum_;

for(long int i = 0; i<mb; i++)

{

sum_ += test_[i];

}

printf("__ %e\n\n", sum_);

}

MPI_Finalize();

}

GouthamK_Intel · ‎11-11-2020

Hi,

Thanks for reaching out to us!

Could you please share details about your system environment and commands which you have used to compile and run the code?

Please provide the below details:

No. of ranks on which you running the code:

Across how many nodes you are launching the executable:

MPI version:

Interconnect details:

output of the code on your system

Regards

Goutham

AThar2 · ‎11-15-2020

Hi

I have used various compilers including the intel, I have used multiple mpi including intel-mpi. I only run on 1 node since this is about testing the shared memory stuff. Running on intel gold with 40 physical cores.

All of this does not matter really. I also used my own laptop with intel icore9. It would take you two seconds to run the code I put above and you will realise the same outcome as mine observations !

The problem is more about the page size. It seems that malloc can be flexible with page sizes , so if you allocate a large Chunk of memory it will try to do something to the page size.

However, when using MPI's MPI_win_shared_allocate, it looks like that the page size is fixed.

I have my info from this paper !

https://www.mcs.anl.gov/~balaji/pubs/2015/ppmm/ppmm15.stencil.pdf

Is this something you have in your plans to do something about or have any advice to what to do without having to play with MPI source code itself - or the actual Linux system. (For HPCs the latter is not ideal as we don't normally have root access, while the former is impossible if you use intel MPI)

AThar2 · ‎11-16-2020

@GouthamK_Intel I saw your reply on email, but it does not seem to appear here in this thread.

But thanks for trying to run my small code and glad to hear you can reproduce my problem. Please keep me updated with the progress of the developers.

Thanks

Ali

jimdempseyatthecove · ‎11-11-2020

This may be a "first touch" issue.

Please test:

1) allocate buffer(s) once
2) run timed loop (at least) twice

Check the time(s) of the 2nd (and later) loops.

"first touch": Allocation obtains addresses in virtual memory (but no physical RAM, nor storage in the Page FIle). Subsequent to VM allocation, first time the process touches (write or read) of a VM page, page fault occurs, the O/S then allocates a page size number of blocks in the page file (migh wipe them too) and obtains a page size number of bytes from RAM (which may require paging some other processor's data out to the page file).

Jim Dempsey

AThar2 · ‎11-15-2020

@jimdempseyatthecove I understand why you are suggesting this. I did put a loop over the process of filling/accessing the buffers up to 20 times. It did not change the observation, I then read the paper I referred to in my last thread. That seemed to explain why, namely for large buffers, you get a lot of page misses when using arrays allocated through MPI's API.

GouthamK_Intel · ‎11-19-2020

Hi @Ali Thari

We have gone through the source code which you have provided, we observed that you are using MPI_Win_allocate_shared call to allocate shared memory for the processes in a node but you are using only a single rank to compute the data. Similarly, you are creating a buffer with malloc call for a single rank and computing the data.

But the functionality/usability of both calls is different. When you create buffer using Malloc call for a rank then that buffer is restricted to be used by that rank only and no other ranks can access that buffer.

Whereas MPI_win_allocate_shared call is a collective call executed by all processes in the group of comm. On each process, it allocates memory of at least size bytes that is shared among all processes in comm, and returns a pointer to the locally allocated segment in baseptr that can be used for load/store accesses on the calling process. The locally allocated memory can be the target of load/store accesses by remote processes.

For more information on MPI_win_allocate_share call refer the below pdf; Page: 407: Window That Allocates Shared Memory

https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

As the functionality of both the calls is different their performance shouldn't be compared.

Regards

Goutham

GouthamK_Intel · ‎12-02-2020

Hi,

Could you please let us know if your issue is resolved or not.

Regards

Goutham

GouthamK_Intel · ‎12-08-2020

Hi,

As we haven't heard back from you, we are considering that your issue has been resolved and we have answered all your queries. So we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only

Have a Good day!

Thanks & Regards

Goutham