Please have a look at this simple code:
Why is the accessing and filling up of the shared buffer 1000x slower than through the raw pointer?
Thanks for reaching out to us!
Could you please share details about your system environment and commands which you have used to compile and run the code?
Please provide the below details:
No. of ranks on which you running the code:
Across how many nodes you are launching the executable:
output of the code on your system
This may be a "first touch" issue.
1) allocate buffer(s) once
2) run timed loop (at least) twice
Check the time(s) of the 2nd (and later) loops.
"first touch": Allocation obtains addresses in virtual memory (but no physical RAM, nor storage in the Page FIle). Subsequent to VM allocation, first time the process touches (write or read) of a VM page, page fault occurs, the O/S then allocates a page size number of blocks in the page file (migh wipe them too) and obtains a page size number of bytes from RAM (which may require paging some other processor's data out to the page file).
I have used various compilers including the intel, I have used multiple mpi including intel-mpi. I only run on 1 node since this is about testing the shared memory stuff. Running on intel gold with 40 physical cores.
All of this does not matter really. I also used my own laptop with intel icore9. It would take you two seconds to run the code I put above and you will realise the same outcome as mine observations !
The problem is more about the page size. It seems that malloc can be flexible with page sizes , so if you allocate a large Chunk of memory it will try to do something to the page size.
However, when using MPI's MPI_win_shared_allocate, it looks like that the page size is fixed.
I have my info from this paper !
Is this something you have in your plans to do something about or have any advice to what to do without having to play with MPI source code itself - or the actual Linux system. (For HPCs the latter is not ideal as we don't normally have root access, while the former is impossible if you use intel MPI)
@jimdempseyatthecove I understand why you are suggesting this. I did put a loop over the process of filling/accessing the buffers up to 20 times. It did not change the observation, I then read the paper I referred to in my last thread. That seemed to explain why, namely for large buffers, you get a lot of page misses when using arrays allocated through MPI's API.
@GouthamK_Intel I saw your reply on email, but it does not seem to appear here in this thread.
But thanks for trying to run my small code and glad to hear you can reproduce my problem. Please keep me updated with the progress of the developers.
Hi @Ali Thari
We have gone through the source code which you have provided, we observed that you are using MPI_Win_allocate_shared call to allocate shared memory for the processes in a node but you are using only a single rank to compute the data. Similarly, you are creating a buffer with malloc call for a single rank and computing the data.
But the functionality/usability of both calls is different. When you create buffer using Malloc call for a rank then that buffer is restricted to be used by that rank only and no other ranks can access that buffer.
Whereas MPI_win_allocate_shared call is a collective call executed by all processes in the group of comm. On each process, it allocates memory of at least size bytes that is shared among all processes in comm, and returns a pointer to the locally allocated segment in baseptr that can be used for load/store accesses on the calling process. The locally allocated memory can be the target of load/store accesses by remote processes.
For more information on MPI_win_allocate_share call refer the below pdf; Page: 407: Window That Allocates Shared Memory
As the functionality of both the calls is different their performance shouldn't be compared.