MPI Vs Serial version, difference in the solution - Page 2

Julio · ‎06-10-2017

Dear Community;

We usually read that parallel computation (either OpenMP or MPI) is not deterministic. This is usually easy to visualize in I/O. However, I am comparing my Serial and Parallel version (MPI) of the same code and the difference is close to 0.3% in the error. The error is defined as the maximum difference between two subsequent time steps. Thus, the difference between the error of the serial version and the MPI is close to 0.3 %. Although the error is small, the solution presents slight differences, 1% in certain regions of the computation. When I increase the number of processes then that difference is more notorious. Again, that difference is still small.

I was wondering about the cause of this problem. Also, I would like to ask two more questions. I was thinking that maybe the problem comes because I am using -O3 flags. I have heard that it is usually not recommended to use optimization with MPI... but I have not found anything consistent to back up my decision.

As far as good practice, can we change MPI_REDUCE + MPI_BROADCAST by just a MPI_ALLREDUCE ? Again, I have heard that it is not good to change those for one MPI_ALLREDUCE, but it seems that the literature does not prove it directly. I am using the MPI_REDUCE + MPI_BROADCAST to use the same time step in the whole computation.

Finally, as far as 2D and 1D domain decomposition is there any rule that allows to decide between these two approaches? My problem is 2D but I have seen other people using 2D decomposition and they end up with ghost cells in four directions, i.e: Up, down, left and right. I only have left and right neighbours.

Thanks you all!!!

jimdempseyatthecove · ‎06-20-2017

Ok, I see your indexing scheme. Although it is not efficient, the interchange is in proper order (provided strided transfers take place).

At issue here now is you update interval of 10 integrations causes any influence between ranks to not being observed for 10 time intervals.

To mitigate this to a great extent, the width of the ghost cells array should be expanded from 1 to 10 (your update interval). IOW your objective is to introduc redundant calculations, and increasing amount of data transfer in order to reduce the frequency of the transfers. This can be a valid tradeoff as there are different latencies for transfer initiation and data transfer. IOW two columns of ghost cells can be transferred quicker than two transfers of one column of ghost cells. You will have to determine the sweet spot for the number of ghost cells verses update interval.

Note, your first presumption of doing this too high of cost of additional computation. This isn't necessarily so. If your C++ programmer gets out of his Array-Of-Structures mentality and codes for structures of arrays, you can compose your Imax in multiples of the SIMD vector width. On an AVX512 system, on average, the 10 ghost cells would add 1 additional vector operation (8 doubles). Even with AVX2 you might want to consider adjusting the width of ghost cells (and update frequency) such that the Imax is a multiple of number of doubles in cache line (8). i.e. in range of 8:15.

Jim Dempsey

Julio · ‎06-20-2017

Thank you Jim, you said something very important that made me change my mind. I will set the frequency at 1, so that each time step I update the ghost cells. I had not thought about the fact that I will need "N" number of ghost columns per each "N" frequency. I thought I can use always the same ghost cells and update it depending upon the frequency I wanted.

jimdempseyatthecove · ‎06-20-2017

Please report back your findings. Confirmation is invaluable for the readers of this thread.

Don't forget to try multiple ghost cells (e.g.2) to test both performance difference and numerical consistency.

If this looks promising, then experiment with Imax being multiple of vector width with one additional vector width of ghost cells.

Jim Dempsey

Gregg_S_Intel · ‎06-20-2017

Copy to a buffer takes time, but it saves time in the communication, which is by far the slowest part. Overall it is a win.

Can't expect same results computing with different data. If the full suite of neighboring data is not communicated at every time step, then the parallel version is doing a different calculation than the serial version. A useful calculation, certainly an approximation commonly used in parallel computing, but different.

Julio · ‎06-26-2017

Dear All;

After different tests, I fund out the implementation is perfect. I let the code run for more iteration until the Steady State solution. At that stage, the error was around 0.43% between the serial and the parallel version. I tested it under different number of processes and the solution was also the same.

I want to thank you all for your comments and suggestions. Also, Mr. Gregg advice, but I think my implementation was wrong. I tried to make another datatype out of the the MPI_VECTOR_TYPE. It did not work. I do not blame MPI for it, I guess that I cannot create a data type out of another one.

This is what I did:

Call MPI_TYPE_VECTOR(Jmax,1,Imax,MPI_REAL8,GhostCells,ierr)
Call MPI_TYPE_COMMIT(GhostCells, ierr)

And out of it I tried:

Call MPI_CONTIGUOUS(Jmax,GhostCells,GhostCellsCont,ierr)
Call MPI_TYPE_COMMIT(GhostCellsCont,ierr)

When I call the Send and Recv I use the last data type (GhostCellsCont) in the datatype argument. The error I got was: return code 174.

Now I am facing other issues with my Hybrid formulation (MPI+OpenMP) that I will post in a separate thread.

Thanks

Julio