MPI Vs Serial version, difference in the solution

Julio · ‎06-10-2017

Dear Community;

We usually read that parallel computation (either OpenMP or MPI) is not deterministic. This is usually easy to visualize in I/O. However, I am comparing my Serial and Parallel version (MPI) of the same code and the difference is close to 0.3% in the error. The error is defined as the maximum difference between two subsequent time steps. Thus, the difference between the error of the serial version and the MPI is close to 0.3 %. Although the error is small, the solution presents slight differences, 1% in certain regions of the computation. When I increase the number of processes then that difference is more notorious. Again, that difference is still small.

I was wondering about the cause of this problem. Also, I would like to ask two more questions. I was thinking that maybe the problem comes because I am using -O3 flags. I have heard that it is usually not recommended to use optimization with MPI... but I have not found anything consistent to back up my decision.

As far as good practice, can we change MPI_REDUCE + MPI_BROADCAST by just a MPI_ALLREDUCE ? Again, I have heard that it is not good to change those for one MPI_ALLREDUCE, but it seems that the literature does not prove it directly. I am using the MPI_REDUCE + MPI_BROADCAST to use the same time step in the whole computation.

Finally, as far as 2D and 1D domain decomposition is there any rule that allows to decide between these two approaches? My problem is 2D but I have seen other people using 2D decomposition and they end up with ghost cells in four directions, i.e: Up, down, left and right. I only have left and right neighbours.

Thanks you all!!!

jimdempseyatthecove · ‎06-11-2017

>>We usually read that parallel computation (either OpenMP or MPI) is not deterministic

This statement has to be qualified.

When a serial program produce results containing error (lack of precision thus containing round off errors), then partitioning the problem will also produce results containing error (lack of precision thus containing round off errors). When these errors are accumulated, either within the serial application. or in pieces in a parallel process, then the order of accumulation can affect the results. IOW results typically differ. Note, in the serial program, if you perform the accumulation in the other direction (e.g. indexing an array of partial results backwards), you can end up with different values.

It should not be overlooked that when partitioning a problem (parallelizing), that the partitioned program, under the right circumstances, will produce more accurate results. IOW your 0.3% could potentially be 0.3% closer to the correct result.

Explanation: When accumulating results of relatively large values with relatively small values (with long sequence of binary fractional values), that the small values will have a larger portion of the fractional value, and in some cases a goodly portion of the integral part discarded. When partitioning the problem a lessor amount of this loss tends to occur.

The real questions for you to answer are:

Can you figure out why the difference?
And more importantly, is the difference in the direction of a better result?

Here is a suggested test. If your serial program produces an array of results of which you reduce to a sum, then sort the array by magnitude of absolute values and produce the sum in order of smallest magnitude (not absolute), and compare the results. I would expect the difference to be larger than 0.3%. If not, then your conversion to MPI may have an underlying problem.

Jim Dempsey

Gregg_S_Intel · ‎06-11-2017

At the heart of it all is the fact that floating-point arithmetic is not associative; for example, a + (b + c) is not equal to (a + b) + c. It follows that a parallel reduction produces a different result than a serial reduction, even when the parallel reduction is deterministic. A full description can be found in a textbook on Numerical Analysis such as chapter one of Atkinson. It is a common mistake to believe a serial result is more accurate a parallel result.

Julio · ‎06-11-2017

Thank you very much for your kind replies. Here I want to make a disclaimer, I am a mechanical engineer with very but very few knowledge on compilers, computers architecture an so forth. I sued computers to solve my problems, although I know MPI,OpenMP and FORTRAN very well; but I will not call my self as an expert in this field. Therefore, somethings may be very challenging for me to get. Nonetheless, thank you very much for your time in this threads. I still have few questions from the previous replies.

Mr. Greeg S.. I guess that the associative properties does not hold for parallel computing since the round off error behaves differently in each process, since each process have a different chunk of the domain. However, I would like to know why you said that we usually believe that a serial results is more accurate than the parallel one. Finally, you said that under the right conditions partioning the problem will deliver more accurate results, why is that? if again, the ghost cell carry all that information to the neighbouring cells. What happens at the compiler level or hardware level that improves the solution.

Any suggestions with the MPI_REDUCE = MPI_BROADCAST changed by MPI_ALLREDUCE ? and -O3 for MPI?

thanks you all!

Mr. jimdempseyatthecov

I see the difference of the sum in different direction, it makes a lot of sense since the accumulative error due to the difference direction will be different. However, when we send the ghost cells to the other neighbours, that problem may be suppress because the ghost cells may contain that "accumulative error" implicit in it.

Gregg_S_Intel · ‎06-12-2017

Floating-point arithmetic is not associative, period. Regardless of parallelism.

For example, adding 0.1 + (0.2 + 0.3) may yield 0.59999999999999998, and adding (0.1 + 0.2) + 0.3 may yield 6.0000000000000009. Both results are correct for floating-point arithmetic. Neither is a problem. I encourage you to read a full discussion such as chapter one of Atkinson, because one cannot effectively compress this topic into a brief explanation.

Differences between serial and MPI parallel results are not a problem, they are normal. When a reduction happens in parallel, the order of the floating-point operations is changed. For a new order of operations, the accumulated error may be larger or smaller.

Avoiding -O3 with MPI sounds like bad advice.

I do not know any reason to choose reduce + broadcast over allreduce.

jimdempseyatthecove · ‎06-12-2017

>> I would like to know why you said that we usually believe that a serial results is more accurate than the parallel one.

This is (I assume) due to the serial program being used to produce the reference dataset. And then assuming any variation from this is to be considered an error. Even though the deviating program may in fact be producing closer results.

>>under the right conditions partioning the problem will deliver more accurate results, why is that?

You are aware that floating point operations can produce round off errors. While the round off error of a single operation can be thought of 1/2 lsb in the result, what is actually happening for addition and subtraction is one of the arguments of the operation has no loss, while a second argument has zero to all bits of loss. The larger the difference in magnitude, the more bits are lost. For summation, the closest result is produced by summing from least magnitude to largest magnitude, though this may be impractical to sort the array of values to be summed.

Assume as an example, you have a 2 digit calculator. Which can produce non-zero results:

0.1, 0.2, ... 9.9, 10., 11., ... 99.

What happens when you sum the numbers:

10, 1.0, 1.9, 1.8

The answer you get is 13

sum the numbers in reverse order (lowest to highest)

The answer is 14

Which is closer to the correct answer of 14.7?

Assume you parallelize the first instance

10 + 1.0 || 1.9 + 1.8
11 || 3.7
reduce
14

(the above was done without round-up for illustrative purposes)

In the above "parallel" example, the end result was closer. Different orders of numbers will effect the result.

RE: ghost cells

There is insufficient information as to how these cells are produced and used. In my experience, the boundary cells of a domain can be representative of the corresponding boundary of the adjacent domain (tile). These numbers can be synchronized at each iteration of a simulation .OR. to reduce synchronization they can be extrapolated for a few iterations and synchronized/resolved less frequently.

Jim Dempsey

Julio · ‎06-13-2017

Mr. Dempesey;

I greatly appreciate your kindness and time in this question. Your explanation helped me a lot.

Thanks

Julio

Julio · ‎06-16-2017

Dear Mr. Dempesey;

I have checked my results under different conditions, and I found the same trend. The solution varies when I increase the number of npes. If I keep the npes small, i.e :2,3,4; the solution approaches the Serial Version (Here I am assuming the Serial is right).

I was discussing this with my advisor, and We both agreed that the round-off error in fortran is fixed since we specified each variables as kind=8. Therefore, we do not see how the floating point operation affects the round-off error.

Based on your last explanation, I concluded that parallel computing yields better results because reducing the floating point the round-off error decreases too. However, how is that possible if the we are using the same precision in each npes.

I am attaching the solution, that is the vertical component of the velocity vector at some point in the channel. It is interesting to note that at the first part of the solution the three solution are very similar, but the differences start after the first quarters (approximate). The difference is around 32% (between the serial and the one with npes=10).

It would be very interesting knowing your point of view.

Thanks in advance

Gregg_S_Intel · ‎06-16-2017

Even double-precision floating-point has round-off error and is not associative. The perturbations in parallel reductions are well known and well documented, see for example http://bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf

Both you and your advisor would do well to hasten to read chapter one of Atkinson with alacrity. It would be much easier and more comprehensive than trying to teach the material question-by-question in this forum.

Julio · ‎06-16-2017

Thank you Mr. Greeg, my ultimate goal was to bother the people of this forum.

I am quite sure, my advisor understands very well errors and their sources. I already read the first chapter and it does not address my specific problem, Why the solution is different when the number of CPUS increases (Difference 32 %)? Why is this difference when the number of CPUS (up to 3) is very small (difference 0.3%). This problem is not addressed in the textbook, and I have not found anything similar on the web. You are stressing to much in the reduction but that reduction is only for reducing the error and time step (both are used to stop the iterations, but they do not control the final stage). Because, delta T is the same in all the ranks.

Finally, I strongly believe that forums are for these type of healthy discussions. I am not asking for solving my problem, just different perspectives and advice.

Thanks for the material you recommended.

Julio

Gregg_S_Intel · ‎06-16-2017

The variation you are seeing does not look especially large. Validated, provably correct industry codes produce much worse examples. Some applications add a flag or command which allows the user to select serial reductions, sacrificing performance for consistency.

This is what happens with floating-point arithmetic when the order of operations is changed, and it gives an idea how much error +/- is in the serial calculation. The same kind of variation can occur moving from one architecture to another as well, both serial. Some steps can be taken to reduce the effect, by ordering operations to avoid catastrophic cancellation and such.

If you don't like the normal explanation about order of floating-point operations, the alternative would be a bug in your parallel code. The simple act of calculating on multiple processors does not produce different results.

Julio · ‎06-16-2017

Thank you Mr. Gregg. I have also thought about a bug, but for small number of CPUS the difference is negligible (0.5%). As a result, I believed that the implementation is right.

I have also checked thoroughly the behavior of the code when increasing the CPUS, checking he ghost cells in each arrays. For these checks everything looks perfect. Nonetheless, I am not saying the implementation is 100% free of bugs. I will try to setup other tests and see how it behaves.

jimdempseyatthecove · ‎06-18-2017

>> checking (t)he ghost cells in each arrays

Can you show how you implement your ghosting cells (including time step integration).

Note, your serial implementation has no ghosting cells.
3-way parallel has two ghosting cell boundaries
10-way parallel has nine ghosting cell boundaries.

The skew observed in your data leads me to suspect that your ghosting method is the cause as opposed to round off error.

One of the characteristics of introducing inter-process coordination data points (aka ghosting cells), is this can introduce a time delay skew in results data.

Are you using Euler method for integration or one of the Adams multi-step methods for integration?
Euler tends to accumulate errors more than Adams (YMMV).

Jim Dempsey

Julio · ‎06-18-2017

Hi Mr. Jim thanks for your advise and your concern in this issue. I will share part of my implementation.

The following section takes care of the communication, since I am using persistent communication.

**********************************************************************************************

Subroutine MPI_StartUp
use Variables
use mpi
implicit none

!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)

!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
  Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
  Imax = Imax + 2
Else
  Imax = Imax + 1
endif

! Computing neighboars
if (MyRank.eq.0) then
  Left = MPI_PROC_NULL
else
  Left = MyRank - 1
end if

if (MyRank.eq.(npes -1)) then
  Right = MPI_PROC_NULL
else
  Right = MyRank + 1
end if


! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays

!Creating the Data type to care of non contiguous data arrays.
Call MPI_TYPE_VECTOR(Jmax,1,Imax,MPI_REAL8,GhostCells,ierr)
Call MPI_TYPE_COMMIT(GhostCells,ierr)

!This section defines the displacement and communicates the Local Imax among the ranks
! for communications purposes (Valid for BC and I/O)
Displacement = 0
ImaxRecv = 0

if (MyRank ==0) then
  do i = 1, npes -1
    Displacement(i+1) = ((ImaxRecv(i) - 1) + Displacement(i))-1
  end do
end if

Call MPI_GATHER(Imax,1,MPI_INTEGER,ImaxRecv,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,1),1,GhostCells,Left,Tag,MPI_COMM_WORLD,Req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,1),1,GhostCells,Right,Tag,MPI_COMM_WORLD,Req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,1),1,GhostCells,Right,Tag,MPI_COMM_WORLD,Req(3),ierr)
Call MPI_RECV_INIT(u_old(1,1),1,GhostCells,Left,Tag,MPI_COMM_WORLD,Req(4),ierr)

! I have more of these for each variable.

****************************************************************************************************************************

This section is the main program, where I use MPI_REDUCE and MPI_BROADCAST to use the lowest error and time step.

!Setting up the partition for each processor and initializing MPI
Call MPI_StartUp
UpdateBoundaries = 10  !It sets the frecuency for updating the GhostCells

Call Freestream
    iRestart    = 0

    if(iRestart.eq.0)then
        Call GridBC
        Call Grid
        Call WAG
        Call BC_Supersonic
    else
        Call GridBC
        Call Grid
        Call Read_SolutionRestart
    endif

 Call MPI_BARRIER(MPI_COMM_WORLD,ierr)

 DO kk=1, 2001

 !Different subroutines computing stresses and so forth.

! communicating the maximum error among the processes and delta t

 Call MPI_REDUCE(eps,epsGlobal,1,MPI_REAL8,MPI_MAX,0,MPI_COMM_WORLD,ierr)
 Call MPI_BCAST(epsGlobal,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

 Call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_REAL8,MPI_MIN,0,MPI_COMM_WORLD,ierr)
 if(MyRank.eq.0) delta_t = delta_tGlobal
 Call MPI_BCAST(delta_t,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

! Printing the maximum error and the iterations
 if(MyRank.eq.0) then
   write(*,*) kk,epsGlobal,(kk*delta_t)
   write(32,*) kk,epsGlobal
 endif

Call Swap !This subroutine update the new variables to be used in the next iteration

!Starts the communication
if (mod(kk,UpdateBoundaries).eq.0) then
  Call MPI_STARTALL(28,Req,ierr)
  Call MPI_WAITALL(28,Req,Status,ierr)
end if

if (epsGlobal.le.10.0**(-8.0)) then
	exit
endif

Call MPI_BARRIER(MPI_COMM_WORLD,ierr) ! Barrier in MPI
Enddo

!Output sections

***************************************************************************************************************************

I am using Euler for my time integration and since I am using 1D decomposition and my problem is 2D I only have two ghost cells, one in each direction (horizontal direction)

Thanks in advance.

Julio

jimdempseyatthecove · ‎06-19-2017

It appears that you are re-computing a variable called delta_t. Are you using variable time steps?

If so, it may be enlightening to produce and save the table of variable time steps as used by the serial program, and then re-use those time steps in the 3-rank and 10-rank programs. Note, I am not suggesting your final application does this, rather this may isolate where the difference is being generated. Knowing this, you can then devise a workable solution.

I do not see how you are using your ghost cells.

What happens to the results data when you set UpdateBoundaries to every iteration (unknown as to if this is 0 or 1)?

Jim Dempsey

Julio · ‎06-19-2017

Hi Jim, thank you for you suggestion. In fact the time step is exactly the same. I printed the Time step from the serial version and the parallel versions under different npes, and the solution is very encouraging because the time step predicted was the same. Well almost the same, a difference of 0.0001%. However, that was not the same for the error. Here I want to draw your attention. I believe, that the error is being reduced and broadcasting out of time or I might be using the errors from the previous time step.

The main code computes the time step almost at the top and the error almost at the end of the main loop. Therefore, I am thinking that when the code hits the MPI_REDUCE and MPI_BROADCAST for the time step, the buffer is ready with the right values, but not the case for the error. Let me show you.

DO kk=1, 2001

 Call (name of Subroutine)
 Call (name of Subroutine)
 Call Time_Step    ! Here is where I compute the time step
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call Update    ! Here is where I compute the error

! communicating the maximum error among the processes and delta t

 Call MPI_REDUCE(eps,epsGlobal,1,MPI_REAL8,MPI_MAX,0,MPI_COMM_WORLD,ierr)
 Call MPI_BCAST(epsGlobal,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

 Call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_REAL8,MPI_MIN,0,MPI_COMM_WORLD,ierr)
 if(MyRank.eq.0) delta_t = delta_tGlobal
 Call MPI_BCAST(delta_t,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

! Printing the maximum error and the iterations
 if(MyRank.eq.0) then
   write(*,*) kk,epsGlobal,(kk*delta_t)
   write(32,*) kk,epsGlobal
 endif

Call Swap

!Starts the communication
if (mod(kk,UpdateBoundaries).eq.0) then
  Call MPI_STARTALL(28,Req,ierr)
  Call MPI_WAITALL(28,Req,Status,ierr)
end if

if (epsGlobal.le.10.0**(-8.0)) then
	exit
endif

Call MPI_BARRIER(MPI_COMM_WORLD,ierr) ! Barrier in MPI
Enddo

For restrictions reasons I used (name of Subroutine) instead of the real names, sorry about that. If you see, the subroutine UPDATE might not be finished when I am reducing error. However, some ranks might be lagged behind, I do not know. I have read that the collective communications have implicit barriers, but I am not that sure. I do not what the specifications says. I am thinking about doing the following modifications:

DO kk=1, 2001

 Call (name of Subroutine)
 Call (name of Subroutine)
 Call Time_Step    ! Here is where I compute the time step
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call (name of Subroutine)
 Call Update    ! Here is where I compute the error
 Call Swap

!Starts the communication
if (mod(kk,UpdateBoundaries).eq.0) then
  Call MPI_STARTALL(28,Req,ierr)
  Call MPI_WAITALL(28,Req,Status,ierr)
end if
! Here I will wait until everything is done and every rank has sent/recevied the ghost cells
! updated by Swap. Thus every rank must hit the MPI_WAITALL, I kind of barrier

! communicating the maximum error among the processes and delta t

 Call MPI_REDUCE(eps,epsGlobal,1,MPI_REAL8,MPI_MAX,0,MPI_COMM_WORLD,ierr)
 Call MPI_BCAST(epsGlobal,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

 Call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_REAL8,MPI_MIN,0,MPI_COMM_WORLD,ierr)
 if(MyRank.eq.0) delta_t = delta_tGlobal
 Call MPI_BCAST(delta_t,1,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

! Printing the maximum error and the iterations
 if(MyRank.eq.0) then
   write(*,*) kk,epsGlobal,(kk*delta_t)
   write(32,*) kk,epsGlobal
 endif

if (epsGlobal.le.10.0**(-8.0)) then
	exit
endif

Call MPI_BARRIER(MPI_COMM_WORLD,ierr) ! Barrier in MPI
Enddo

In this way the MPI_WAITALL acts as a barrier, but still I am not sure if thise will require another MPI_BARRIER.

Thanks for your time !!!

Julio

jimdempseyatthecove · ‎06-19-2017

!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,1),1,GhostCells,Left,Tag,MPI_COMM_WORLD,Req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,1),1,GhostCells,Right,Tag,MPI_COMM_WORLD,Req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,1),1,GhostCells,Right,Tag,MPI_COMM_WORLD,Req(3),ierr)
Call MPI_RECV_INIT(u_old(1,1),1,GhostCells,Left,Tag,MPI_COMM_WORLD,Req(4),ierr)

This looks like you have miss-constructed your ghost cells.

In Fortran, adjacent left-most indexes are adjacent in memory

A(n,m) and A(n+1,m) are in adjacent memory locations.

From my little understanding of what you are trying to do, I think you should be partitioning the second index by the number of ranks.

Jim Dempsey

Julio · ‎06-19-2017

HI Jim, you are right. The problem is that the person who developed the code used fortran but he was thinking in C++. As a results, I needed to create a data type in order to get the right values since they were non-contiguous.

	Call MPI_TYPE_VECTOR(Jmax,1,Imax,MPI_REAL8,GhostCells,ierr)
	Call MPI_TYPE_COMMIT(GhostCells,ierr)

I will keep you posted abut this thread, instead of bugging you all constantly.

Gregg_S_Intel · ‎06-19-2017

It would be much faster to copy the data to a contiguous buffer before sending.

jimdempseyatthecove · ‎06-20-2017

HPC == High Performance Computing
.not.
HPC == High Power Consumption

The data should be laid out, and accessed, in a manner that yields high performance. This may require a little re-conditioning of your C++ programmer for use of Fortran.

Julio, even if you use a strided type, what you are expressing still doesn't make sense.

expressively you have a virtual array A(ImaxGlobal, JmaxGlobal) and you are determining the portion of ImaxGlobal for each rank, Different ranks may have different values for Imax. Each rank then allocating a slice Aslice(Imax, JmaxGlobal). Is this correct??

If this be the case, then

rank 0 has no left ghost cells, and has Aslice(Imax, 1:JmaxGlobal) for right ghost cells
rank 1 has Aslice(1, 1:JmaxGlobal) for left ghost cells, and has Aslice(Imax, 1:JmaxGlobal) for right ghost cells
rank last has Aslice(1, 1:JmaxGlobal) for left ghost cells, and has no right ghost cells

Note, each rank same named Aslice and Imax but have different values and storage locations.

Now then, Assuming (incorrectly for abstraction purposed) Fortran indexed the way C++ does.
Explain why your 2nd MPI_SEND_INIT uses u_old(Imax-1,1) instead of u_old(1, 1)???

*** Note, this is not to say to correct your code that you use u_old(1,1) as that fixes the abstraction of the inverted indexing. The actual fix will involve more than this.

I agree with Greg, in that in cases where you have to use strided array sections that you should copy the stride values into a contiguous temp buffer, and then send that. However, now you visibly see the overhead you have created in the code (an embarrassment in optimization is now exposed).

Jim Dempsey

Julio · ‎06-20-2017

Hi Jim, I strongly appreciate your comments. They are very valuable for me.

Here I have two important things to mention. Firstly and more important, I am mechanical engineer with no background on computer sciences. Therefore, my knowledge in memory layout, access and so fort is very limited. Nonetheless, I try to understand as much as possible. Secondly, and more important. My code was develop by my advisor and a former PhD student several years ago, and he said "Julio do not change anything from the Kernel". Although I took MPI classes and few workshops with Jhon Urbanic, still there are lot of things that I need to learn.

Having said that I will reply to your comments.

expressively you have a virtual array A(ImaxGlobal, JmaxGlobal) and you are determining the portion of ImaxGlobal for each rank, Different ranks may have different values for Imax. Each rank then allocating a slice Aslice(Imax, JmaxGlobal). Is this correct??

That's correct. That is why I am using dynamic arrays, because each rank has its own size. Since I am using 1D Decomposition only the 'i' direction is shared among the Ranks available.

If this be the case, then

rank 0 has no left ghost cells, and has Aslice(Imax, 1:JmaxGlobal) for right ghost cells
rank 1 has Aslice(1, 1:JmaxGlobal) for left ghost cells, and has Aslice(Imax, 1:JmaxGlobal) for right ghost cells
rank last has Aslice(1, 1:JmaxGlobal) for left ghost cells, and has no right ghost cells

You are also right here, but my implementation takes care of this issue by itself using MPI_PROC_NULL, look at this:

! Computing neighboars
if (MyRank.eq.0) then
  Left = MPI_PROC_NULL
else
  Left = MyRank - 1
end if

if (MyRank.eq.(npes -1)) then
  Right = MPI_PROC_NULL
else
  Right = MyRank + 1
end if

Now then, Assuming (incorrectly for abstraction purposed) Fortran indexed the way C++ does.
Explain why your 2nd MPI_SEND_INIT uses u_old(Imax-1,1) instead of u_old(1, 1)???

Here you point is also valid, but if you look at the destination it says "right". Thus, if my rank is (npes-1) then the previous snippet takes care.

I agree with Greg, in that in cases where you have to use strided array sections that you should copy the stride values into a contiguous temp buffer, and then send that. However, now you visibly see the overhead you have created in the code (an embarrassment in optimization is now exposed).

Here is the problem. By copying the buffer to a contiguous buffer I will need to create new arrays and one sub-step between the sending and computing. I am still fetching the data in a non-contiguous arrays that will take time, so adding a copy to a contiguous buffer will not relief the non-contiguous fetch. Creating more arrays for this problem will cause more problems because I will need to allocate more variables

My objective was: Create an MPI application in a easy and friendly manner, because this is the philosophy from my advisor and myself. The idea is to share this code with junior members in our team. Still, I want to make sure my implementation is as efficient as possible without being to fancy that only myself will be able to understand the implementation.