MPI Isend/Irecv Bottleneck

youn__kihang · ‎03-22-2020

Dear all,

I have some questions related to MPI Isend/Recv bottleneck.
Below is my subroutine named "Broadcast_boundary":

      DO NP=1,NPROCS-1
        IF(MYRANK==NP-1)THEN
         CALL MPI_ISEND( ARRAY_1D(L/2-NUMBER),NUMBER,MPI_REAL,NP  ,101,MPI_COMM_WORLD,IREQ1,IERR)
         CALL MPI_IRECV( ARRAY_1D(L/2)       ,NUMBER,MPI_REAL,NP  ,102,MPI_COMM_WORLD,IREQ2,IERR)
         CALL MPI_WAIT(IREQ1,STATUS1,IERR)
         CALL MPI_WAIT(IREQ2,STATUS2,IERR)
        ELSEIF(MYRANK==NP)THEN
         CALL MPI_ISEND( ARRAY_1D(L/2)       ,NUMBER,MPI_REAL,NP-1,102,MPI_COMM_WORLD,IREQ1,IERR)
         CALL MPI_IRECV( ARRAY_1D(L/2-NUMBER),NUMBER,MPI_REAL,NP-1,101,MPI_COMM_WORLD,IREQ2,IERR)
         CALL MPI_WAIT(IREQ1,STATUS1,IERR)
         CALL MPI_WAIT(IREQ2,STATUS2,IERR)
        ENDIF
      ENDDO

The code is designed to communicate the boundary data between np-1 and np from 0 to nprocs.
And here is my sample program:

      L=20000; NUM=500
      ALLOCATE(A(L),B(L))
      CALL RANDOM(A)
      CALL RANDOM(B)

      CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
      TIC=MPI_WTIME() 
      CALL BROADCAST_BOUNDARY(A,NUM)
      TOC=MPI_WTIME()
      MPI_WTIMES(1)=TOC-TIC
   
      CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
      TIC=MPI_WTIME() 
      CALL BROADCAST_BOUNDARY(B,NUM)
      TOC=MPI_WTIME()
      MPI_WTIMES(2)=TOC-TIC

As far as I am concerned, the mpi_wtimes(1) (the elapsed time to communicate an array A) and mpi_wtimes(2) (array B) will be nearly same because the size of A and B are equal.
But after several experiments, I found that mpi_wtimes(1) took about eight times more time than mpi_wtimes(2).

Please let me know if there's an initial process in MPI communication, or if there's something on the first communication that accelerated the next performance, or if my code needs to be improved.
I'll attach the entire code I tested. If you can test it and let me know the results, I think many questions will be answered.

PrasanthD_intel · ‎03-23-2020

Hi,

I have executed your code and got similar results.

Curious, I have added a third array and checked the mpi_wtimes of all three communication processes and observed sometimes the elapsed time for the third array is multiple times larger than the second array.

So the initial process in MPI communication might just not be the reason.

We will investigate this further and come back to you.

Thanks

Prasanth

jimdempseyatthecove · ‎03-23-2020

Try:

! MPI_WTIMES(2, nTimes) ! nTimes=10
L=20000; NUM=500
ALLOCATE(A(L),B(L))
CALL RANDOM(A)
CALL RANDOM(B)
DO I=1,nTimes
  CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
  TIC=MPI_WTIME() 
  CALL BROADCAST_BOUNDARY(A,NUM)
  TOC=MPI_WTIME()
  MPI_WTIMES(1,I)=TOC-TIC

  CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
  TIC=MPI_WTIME() 
  CALL BROADCAST_BOUNDARY(B,NUM)
  TOC=MPI_WTIME()
  MPI_WTIMES(2,I)=TOC-TIC
END DO
DO I=1,nTimes
  print *, MPI_WTIMES(:,I)
END DO

*** also, why do you have the MPI_BARRIER when the MPI_ISEND and MPI_IRECV together with their respective MPI_WAIT's effectively are a barrier between adjecent ranks. IOW the Send transaction won't corrupt the receiver's buffer until the receiver issues the MPI_RECV. IOW you may have an unnecessary barrier. You will need to make the determination as to if the MPI_BARRIER is necessary or not in your actual application.

Jim Dempsey

youn__kihang · ‎03-23-2020

Dear Prasanth,

The more MPI process, the more serious the symptoms become.
In your opinion, can it be a problem for every process to use the same address of array?
Usually, the array is divided well for well-organized MPI codes.
But I can't afford to modify the entire code, so I used the same size array and separated index only.

Thanks,
Kihang

Dear Jim,

Of course, the MPI_BARRIER was performed only in the sample program, and the reason for doing it was determined to help with more strict tic-toc.
Here is the result:

MPI_NPROCS=	4		
NTRY	FIRST	SECOND	RATIO(F/S)
1	3.50E-03	1.19E-03	294%
2	5.01E-04	1.91E-04	263%
3	1.91E-04	9.54E-05	200%
4	9.54E-05	9.54E-05	100%
5	9.54E-05	9.54E-05	100%
6	9.54E-05	1.19E-04	80%
7	9.54E-05	9.54E-05	100%
8	1.19E-04	9.54E-05	125%
9	9.54E-05	1.19E-04	80%
10	9.54E-05	9.54E-05	100%

I think no delays occurred from four times and first and second rates were also seen to return to normal.

In actual code(CFD), so each iteration performs calculations(Do loops) and broadcast_array(Communicate).
In each iteration, the first broadcast takes much longer to communicate than followings, so I came up with the above sample program to ask forum.

Thanks,
Kihang

jimdempseyatthecove · ‎03-23-2020

Kihang,

For CFD performance "hacks" look at an article I wrote in Chapter 5 of High Performance Parallelism Pearls

See: https://software.intel.com/en-us/articles/books-high-performance-parallelism-pearls

for summary of chapters.

And: http://lotsofcores.com/article/code-downloads-parallelism-pearls-books-volumes-one-and-two

Select the Volume One code: http://lotsofcores.com/sites/lotsofcores.com/files/code/Examples.PearlsVolume1.zip

***

The code was written to run on Knights Corner coprocessor, in OpenMP, and was not an MPI application.

What may be of interest is the technique used to greatly reduce the barrier overhead by using inter-thread barriers as opposed to using whole team barriers (somewhat like you are doing with your BROADCAST_BOUNDRY). This same technique should be able to be worked into MPI.

Also, look at: http://lotsofcores.com/

Scroll down about 1/3rd the page to find a video of the effect on barriers (Plesiochronous Phasing Barriers in Action)

Jim Dempsey

PrasanthD_intel · ‎03-26-2020

Hi Kihang,

Could you please provide the following details

1.Hardware Details (Total nodes, Cores, RAM)

2.Interconnect and Provider

Also, can you tell us the mpirun/mpiexec.hydra command you are using to run the executable.

Thanks

Prasanth

youn__kihang · ‎03-27-2020

Thank you all,

Dear Jim,
I read your articles and impressed performance related with phasing barrier,
Then I need some time to understand and evaluate how to apply the technique in my codes.

Dear Prasanth,
Here is my details,
1. Hardware
1) 10 nodes
2) Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
3) 64GB RAM
2. Software
1) Intel 2017u1(2017.1.132)
3. OS
1) CentOS release 6.7
2) Infiniband switch
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.33.5100
4. Command
1) mpirun & mpiexec.hydra : both are the same results.

I have additional requests.
Where I could get a MPI tunning manual for IMPI 17, 18, 19?
Please let me know the URL(files) or how to get that if you know.

Thank you.

jimdempseyatthecove · ‎03-28-2020

Kihang,

>>Then I need some time to understand and evaluate how to apply the technique in my codes.

It likely will take some time, but the payoff ought to be worthwhile.

In the article, the test platform was the Intel Knights Corner 5110p coprocessor with 60 cores, each with 4 HT, for a total of 240 threads. Each thread using AVX512 variant for KNC.

Background. I found and article by James Reinders, Jim Jeffers, et al describing how to optimize a CFD application for use on a many-core system. I was curious as to what they did as I view them as best of the best programmers. I was wondering if they happened to overlook something so I VTuned the application. To my surprise, the application exhibit a substantial amount of time at the OpenMP barrier and/or the equivalent TBB barrier that are implicit at the end of the major parallel loop. My optimization strategy was to reduce the barrier wait time. This was done by partitioning the working data into sufficiently more tiles than threads. The other optimizations also used tiling but their tiles were parallel pipettes whereas I used linear striping (column of X Y Z). In their model they would calculate the next state for all tiles, barrier with all threads, advance tiles, loop. IOW the entire model would advance in phase with the inconvenient overhead of waiting for all threads to reach the next state interval.

In my optimization, each column maintains a state counter, columns are picked in sequence and a test made to make sure the adjacent columns had finished prior state advancement, when necessary spinwait. IOW each thread has essentially its own state dependency barrier with adjacent columns as opposed to all threads in team barrier. There was no time where the complete model was at a given state. In the sample program, there were waves propagating through the model in three different time states. And very little "barrier" (spinwait) experienced by any single thread.

In the case of OpenMP barrier completion notification is relatively short to barrier notification across infiniband on MPI. Therefor, I strongly suspect that a similar technique on MPI would see similar benefit.

In your test code in #1, it appears that you have constructed the equivalent to a pipeline. While your sketch was minimalized, the communication times clearly illustrate the opportunity you have to assure you have something to work on while finished and upcoming data is in flight.

The optimal solution, in my opinion, would likely be use MPI with 1 process per node, then OpenMP within each node. The shared memory OpenMP environment will permit efficient work scheduling to most available next task. Assume one OpenMP thread handles the MPI communication (this can be an oversubscribed thread). Incoming message placed into a work needed buffer (using free list of buffers) containing the data and bitmask of partition on node to be done. The worker threads, when idel, atomically compete to apply the work buffer to the nodes data. Note, with sufficient partitioning of the global (all nodes) data you might reach the point where next work data is always available. Also, data logging/checkpointing can occur without interfering with state advancement.

Jim Dempsey

PrasanthD_intel · ‎04-20-2020

Hi Kihang,

Sorry for the late reply, I have tested this behavior and after a few experiments found that the issue is just initialization costs.

I am providing a tabulated comparison of results for reference:

	               Fully Subscribe	Undersubscribe	Oversubscribe
MPI_RATIO1	         8.06	            4.91	         8.33
MPI_RATIO2	         1.04	            1.09             1.10
MPI_RATIO3	         1.02               0.94	         0.995
MPI_RATIO4	         1.00	            1.07	         1.02

Here I am using the name schema as per your code.

MPI_RATIO<n> = Elapsed time for(Array n / Array n+1)

The reason for initialization costs is that he buffer initializations happen in the first call and those allocations are one source of extra cost.

Also, in real-world applications, a large number of MPI calls are made, generally within loops, and so the higher cost of the first call amortizes and is not generally an issue.

Does your code really get affected by this or are they just interested in knowing about this behavior?

Regards

Prasanth

youn__kihang · ‎04-20-2020

Hi Prasanth,

First of all, I was giving up on this issue a little bit, but thank you for your continued interest and reply.
In conclusion I can't reduce the initialization cost, so I'll take it and handle it.

When I tested it, it wasn't only the first occurrence(like mpi_init).
For example, the following cases also occur in Communication#2.

Communication #1
Calculation #1
Communication #2

And I thought there would be room for improvement of codes because I thought that the higher the mpi_nodes, the higher the ratio (not tested correctly). But I think it's harder than I thought. :-(

I'll try to find ways to reduce communication as much as possible and bring it together at once.
I think we will close this case at this point and let you know when there's another problem.

Thank you so much, Prasanth and Jim.

Best Regards,
Kihang

jimdempseyatthecove · ‎04-21-2020

Kihang,

If you haven't done so already, consider adjusting your code to run 1 rank per node, and OpenMP within the node. This will reduce the number of inter-node communications.

Jim Dempsey

youn__kihang · ‎04-21-2020

Hi Jim,

Thank you for your reply.
I am transcoding a serial code to a hybrid parallel code(MPI+OpenMP).

My current code seems to be less scalable in the calculation process with the OpenMP compared to MPI.
Therefore, according to the combination of MPI/OpenMP procs, there are pros and cons to the result.
Increasing the MPI nodes increases the communication time, and increasing the OpenMP threads decreases the scalability compared to the MPI.
As you recommended, I think my code will be good enough if the scalability of OpenMP improves by MPI.
I'm sure there's a moment when I need to optimize OpenMP.

I'll ask you again then.

Thank you.

jimdempseyatthecove · ‎04-22-2020

>>My current code seems to be less scalable in the calculation process with the OpenMP compared to MPI.

Are you using MKL as well?

If so, you should be aware that MKL has two different libraries:

multi-threaded (default)
single-threaded

The manner in which the ...-threaded is defined is from the perspective of the workings of the MKL library and .NOT. that of the calling application. This naming convention is the inverse of the C/C++ usage where multi-threaded xxx.lib means thread-safe.

When you code your application using OpenMP, you should (generally) link with the MKL single-threaded library.
When you code your application as single threaded, you should (generally) link with the MKL multi-threaded library.

IOW only one of the domains (app vs MKL) is multi-threaded.

This said, with some finesse, you can combine both OpenMP application and MKL multi-threaded. By carefully restricting the combinations of:

Number of threads used by the application: OMP_NUM_THREADS = n
Number of threads used by MKL: MKL_NUM_THREADS = n
and/or reducing OpenMP parallel region spinwait time: KMP_BLOCKTIME=0

Failure to effectively coordinate thread usage between OpenMP application and MKL multi-threaded results in over-subscription, which to some extent can be mitigated by using KMP_BLOCKTIME=0.

Jim Dempsey

PrasanthD_intel · ‎04-23-2020

Hi Kihang,

You can try Multiple EndPoints ( https://software.intel.com/en-us/mpi-developer-guide-linux-multiple-endpoints-support )
to increase the performance of your hybrid application.

Let us know if we can close this thread.

Thanks

Prasanth

jimdempseyatthecove · ‎04-23-2020

From my experience with MPI with OpenMP...

As long as you restrict the MPI communication to the serial portion of the process (or master thread of first parallel region), then you do not need to use Multiple EndPoints as referenced in post #14 above.

*** Note get_omp_thread_num() returns the current parallel region's team member number and not a global thread number.

When parallel regions are .NOT. nested get_omp_thread_num() == 0 is the main thread of the parallel region .AND. is also the main thread of the process.

When parallel regions are nested get_omp_thread_num() == 0 is the main thread of the nested parallel region spawned by arbitrary team member number of the next higher nest level (and was not necessarily get_omp_thread_num() == 0 of that nest level).

If (when) your requirements exceed the above restrictions then use the Multiple EndPoints as referenced in post #14 above.

Jim Dempsey

PrasanthD_intel · ‎05-12-2020

Hi Kihang,

We are closing this thread assuming that your issue got resolved.

Please raise a new thread for any further queries.

Regards,

Prasanth