Run a Serial Fortran code with multiple cores?

Dinhle__Anh · ‎04-22-2020

Hi experts,

I make a CFD code by fortran language, a serial code.

I just build a worstation with 2 Xeon E5-26xxv3. So, I want to run the code with multiple cores/threads. I know there is the auto parallel command: "ifort -parallel name-of-program.f"

I'm a new bird using fortan compiler. So, I would like to ask you some questions:

1. Can a serial code run with multiple cores?

2. Can submit the batch job to specify the number of cores use for serial code?

Thank you very much,

Sampson__Andrew · ‎04-23-2020

Welcome to FORTRAN! You'll love it like most of us do!

To answer your questions:

1. Yes it can... Sorta by itself, but with tremendous gains paired with another software solution - OpenMP. The -parallel option doesn't parallelize the code the way I think you're looking for, but OpenMP does and the ifort compiler provides significant support for it. OpenMP is really not that difficult to implement. My code is over 20,000 lines long and it only took me a week to setup for OpenMP.

One of my favorite tutorials comes from the computing group out of the University of Colorado Boulder and is found at: https://curc.readthedocs.io/en/latest/programming/OpenMP-Fortran.html. There are many other very useful tutorials online. This is just one of them.

2. You can with OpenMP. The documentation for it is on the website above.

Documentation for the -parallel option and what it does is found at: https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parallel-qparallel.

If you're looking for parallelization of simple "do-loops", then -parallel should work, but if you're looking for serious parallelization on a single "node", then look into OpenMP. If you're looking into parallelization across a cluster of nodes, then look into OpenMPI. I hope that helps.

Good Luck!

Andrew

jimdempseyatthecove · ‎04-23-2020

Auto parallelization is only functional for trivial loops. These are generally large arrays processed in a manner without dependencies. For CDF-like applications parts of the application may benefit from auto-parallelism but much of it will not.

Use of explicit parallelism via OpenMP (with proper coding) will get most of the application parallelized. While this presents more work on your part, the rewards (less unproductive wait time) are often worth it. Andrew's experience in post #2 point 1 is quite typical.

Do expect some initial confusion (I prefer to say enlightenment) in converting your CFD to use OpenMP. Once you see (understand) what went wrong with an initial attempt at parallelizing a loop, it will become knowledge that you will apply to the next loop.

For CDF simulation, typically the approach is to partition the volume into smaller sub-volumes (often called tiles). Each thread works on a tile in an inner loop, then any tile-tile interaction is handled in an outer loop. IOW what may have been coded in the serial program as one loop, is now programmed as two nested loops. The inner loop working within each tile (intra-tile), after which you either serially, or parallel, process inter-tile activity. Sketch:

loop
   parallel loop
      intra-tile ! inside tile
   end parallel loop
   parallel loop
      inter-tile ! tile boundaries
   end parallel loop
end loop

Jim Dempsey

jimdempseyatthecove · ‎04-23-2020

Hints,

Save and maintain your serial program as reference.
Generate and save known (assumed) good outputs for use in difference testing of results data.
Note, you may need to generate a copy of the serial program that you modify to produced intermediary results data for comparisons with your parallel development..

Whenever possible: parallel outer loops, vectorize inner loops
*** This often means you do not parallelize the VTune found hottest loop, rather you vectorize that and parallelize the caller (when appropriate).
Employing (entering) a parallel region has an up-front cost. When there is insufficient work in the loop, do not parallelize it.
Start with using the simplest of features of OpenMP.

Note, floating point calculations often produce inexact results. IOW results with a round-off error. The result of a sum of an array containing round-off errors not only has its self a round-off error, but the resultant value may differ depending on the order in which the summation is performed. In a serial loop, the order is that as stated in the source language when vectorization is not used and may differ when vectorization is used. Added atop this, should the procedure be parallelized (with reduction for sum), each thread produces a partial sum, then each of those sums are summed for the result....

Be aware that such summations may (often do) produce an approximation of the sum that differs from the approximation produced by the serial code. Therefor, determination of incorrect/correct results must take this difference into consideration (error indicated when difference .gt. some epsilon).
Jim Dempsey

Dinhle__Anh · ‎04-24-2020

Thank you for your suggestion, Andrew

Sampson, Andrew wrote:
Welcome to FORTRAN! You'll love it like most of us do!
To answer your questions:
1. Yes it can... Sorta by itself, but with tremendous gains paired with another software solution - OpenMP. The -parallel option doesn't parallelize the code the way I think you're looking for, but OpenMP does and the ifort compiler provides significant support for it. OpenMP is really not that difficult to implement. My code is over 20,000 lines long and it only took me a week to setup for OpenMP.
One of my favorite tutorials comes from the computing group out of the University of Colorado Boulder and is found at: https://curc.readthedocs.io/en/latest/programming/OpenMP-Fortran.html. There are many other very useful tutorials online. This is just one of them.
2. You can with OpenMP. The documentation for it is on the website above.
Documentation for the -parallel option and what it does is found at: https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parallel-qparallel.
If you're looking for parallelization of simple "do-loops", then -parallel should work, but if you're looking for serious parallelization on a single "node", then look into OpenMP. If you're looking into parallelization across a cluster of nodes, then look into OpenMPI. I hope that helps.
Good Luck!
Andrew

Dinhle__Anh · ‎04-24-2020

Thank you for your advices, Jim

jimdempseyatthecove (Blackbelt) wrote:
Hints,
Save and maintain your serial program as reference.
Generate and save known (assumed) good outputs for use in difference testing of results data.
Note, you may need to generate a copy of the serial program that you modify to produced intermediary results data for comparisons with your parallel development..
Whenever possible: parallel outer loops, vectorize inner loops
*** This often means you do not parallelize the VTune found hottest loop, rather you vectorize that and parallelize the caller (when appropriate).
Employing (entering) a parallel region has an up-front cost. When there is insufficient work in the loop, do not parallelize it.
Start with using the simplest of features of OpenMP.
Note, floating point calculations often produce inexact results. IOW results with a round-off error. The result of a sum of an array containing round-off errors not only has its self a round-off error, but the resultant value may differ depending on the order in which the summation is performed. In a serial loop, the order is that as stated in the source language when vectorization is not used and may differ when vectorization is used. Added atop this, should the procedure be parallelized (with reduction for sum), each thread produces a partial sum, then each of those sums are summed for the result....
Be aware that such summations may (often do) produce an approximation of the sum that differs from the approximation produced by the serial code. Therefor, determination of incorrect/correct results must take this difference into consideration (error indicated when difference .gt. some epsilon).
Jim Dempsey

Dinhle__Anh · ‎04-24-2020

- Last time, I ran my CFD code in a workstation (serial code) using the non-interative batch file (a job file) like this:

      #!/bin/sh
     #PBS -N Ogive
     #PBS -o output
     #PBS -j oe
     #PBS -l ncpus=2
     cd <directory>
     time ./2D-TFlow2Wall-MCM-HKO

the compiler worked well! To submit the batch job, I have to use the makefile:

     SHELL=/bin/sh
     F = ifort
     FFLAGS= -parallel -par-threshold0

     .SUFFIXES : .f .o
     O=2D-TFlow2Wall-MCM-HKO.o
     2D-TFlow2Wall-MCM-HKO :$(O)
   $(F) $(FFLAGS) -o $@ $(O)
    .f.o :
   $(F) $(FFLAGS) -c $< -o $@

- However, now, I cannot run it on my own PC (Ubuntu 18.4) that use intel fortran compiler in Intel Parallel Studio XE 19. When I type the command:

make 2D-TFlow2Wall-MCM-HKO

It looks ok. But with command: qsub job

it cannot run with message:

~~/Documents/NACA0015-25C$ qsub job
There was an error running the SLURM sbatch command.
The command was:
'/usr/bin/sbatch -e job.e%A -o job.o%A job 2>&1'
and the output was:
'sbatch: error: s_p_parse_file: unable to status file /etc/slurm-llnl/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: fatal: Unable to process configuration file'

Could you please tell me how to fix this problem?

Thank you.

jimdempseyatthecove · ‎04-25-2020

>>file /etc/slurm-llnl/slurm.conf: No such file or directory

Self explanatory. You haven't configured your slurm system. (its .conf file is missing).

Jim Dempsey

Dinhle__Anh · ‎05-10-2020

jimdempseyatthecove (Blackbelt) wrote:
>>file /etc/slurm-llnl/slurm.conf: No such file or directory
Self explanatory. You haven't configured your slurm system. (its .conf file is missing).
Jim Dempsey

Dear jimdempseyatthecove (Blackbelt),

Thank you, I solved the problem. Now, I can run the code with makefile and batch job submittion.

I tried to run the both serial code and the parallel code (modify the serial code with !$omp parallel do/end parallel do). I used 20 cpus for both serial and parallel calculation. There was no different in calculation speed, although the calculation speed increased compare with no job file.

Here is my batch job file:

#!/bin/bash

#SBATCH --job-name=testOMP
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --exclusive
#SBATCH --time=0-20:00:00

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
time ./2D-test

And the picture below shows the example of serial and parallel code.

Did I do the wrong parallel code? And why did the serial code can run with the job file above?

Thank you very much.

Dinhle__Anh · ‎05-10-2020

Example of serial code:

DO J=1,NJ
DO I=1,I0
DO K=1,NPHI
F(K,I,J)=AM1(K,1,I,J)*(PHI(1,I+1,J)-PHI(1,I,J))*BAJ1(I,J)
& +AM1(K,2,I,J)*(PHI(2,I+1,J)-PHI(2,I,J))*BAJ1(I,J)
& +AM1(K,3,I,J)*(PHI(3,I+1,J)-PHI(3,I,J))*BAJ1(I,J)
& +AM1(K,4,I,J)*(PHI(4,I+1,J)-PHI(4,I,J))*BAJ1(I,J)
& +AM1(K,5,I,J)*(PHI(5,I+1,J)-PHI(5,I,J))*BAJ1(I,J)
& +AM1(K,6,I,J)*(PHI(6,I+1,J)-PHI(6,I,J))*BAJ1(I,J)
& +AM1(K,7,I,J)*(PHI(7,I+1,J)-PHI(7,I,J))*BAJ1(I,J)
& +AM1(K,8,I,J)*(PHI(8,I+1,J)-PHI(8,I,J))*BAJ1(I,J)
ENDDO
ENDDO
ENDDO

Modify to parallel code

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I,J,K)
DO J=1,NJ
DO I=1,I0
DO K=1,NPHI
F(K,I,J)=AM1(K,1,I,J)*(PHI(1,I+1,J)-PHI(1,I,J))*BAJ1(I,J)
& +AM1(K,2,I,J)*(PHI(2,I+1,J)-PHI(2,I,J))*BAJ1(I,J)
& +AM1(K,3,I,J)*(PHI(3,I+1,J)-PHI(3,I,J))*BAJ1(I,J)
& +AM1(K,4,I,J)*(PHI(4,I+1,J)-PHI(4,I,J))*BAJ1(I,J)
& +AM1(K,5,I,J)*(PHI(5,I+1,J)-PHI(5,I,J))*BAJ1(I,J)
& +AM1(K,6,I,J)*(PHI(6,I+1,J)-PHI(6,I,J))*BAJ1(I,J)
& +AM1(K,7,I,J)*(PHI(7,I+1,J)-PHI(7,I,J))*BAJ1(I,J)
& +AM1(K,8,I,J)*(PHI(8,I+1,J)-PHI(8,I,J))*BAJ1(I,J)
ENDDO
ENDDO
ENDDO
!$OMP END PARALLEL DO

jimdempseyatthecove · ‎05-10-2020

Your parallel DO is correct. The only comment to add is:

When the outer loop iteration range is not substantially larger (many multiples of threads) than the number of threads performing the work...
OR the outer loop iteration range is not a multiple of number of threads, then the work distribution amongst threads is not balanced.

In your above example, should the work not be balanced, it might be mitigated through use of the COLLAPSE clause. You could experiment with adding COLLAPSE(2) and COLLAPSE(3).

Additionally,

If this nested loop represents a large portion of the computation, then consider swapping the first two indices of AM1 (and swap elsewhere in your program). This might improve vectorization opportunities of the compiler.

    F(K,I,J)=AM1(1,K,I,J)*(PHI(1,I+1,J)-PHI(1,I,J))*BAJ1(I,J)
     &           +AM1(2,K,I,J)*(PHI(2,I+1,J)-PHI(2,I,J))*BAJ1(I,J)
     &           +AM1(3,K,I,J)*(PHI(3,I+1,J)-PHI(3,I,J))*BAJ1(I,J)
     &           +AM1(4,K,I,J)*(PHI(4,I+1,J)-PHI(4,I,J))*BAJ1(I,J)
     &           +AM1(5,K,I,J)*(PHI(5,I+1,J)-PHI(5,I,J))*BAJ1(I,J)
     &           +AM1(6,K,I,J)*(PHI(6,I+1,J)-PHI(6,I,J))*BAJ1(I,J)
     &           +AM1(7,K,I,J)*(PHI(7,I+1,J)-PHI(7,I,J))*BAJ1(I,J)
     &           +AM1(8,K,I,J)*(PHI(8,I+1,J)-PHI(8,I,J))*BAJ1(I,J)

Jim Dempsey

John_Campbell · ‎05-11-2020

Without knowing the value of NJ, I0 or NPHI, and based on the limited code you have provided, the following change appears to be justified and could improve the performance.
If NJ >> num_threads, then collapse is not necessary, bit if NJ is smaller than I0 (not ten?) then collapse(2) would be effective.

Modify to parallel code
    dimension dphi(8)

    !$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I,J,K,dphi) COLLAPSE(2)
      DO J=1,NJ
       DO I=1,I0
        dphi(1:8) = ( PHI(1:8,I+1,J)-PHI(1:8,I,J) ) * BAJ1(I,J)
        
        DO K=1,NPHI
          F(K,I,J) = dot_product ( AM1(K,1:8,I,J), dphi(1:8) ) 
        ENDDO
        
       ENDDO
      ENDDO
    !$OMP END PARALLEL DO

Changing the definition of AM1(1:NPHI,1:8,I,J) to AM1(1:8,1:NPHI,I,J) would improve this loop although you would need to consider other effects.

You would need to consider these changes, based on your knowledge of other parts of the code.

Dinhle__Anh · ‎05-11-2020

John Campbell wrote:
Without knowing the value of NJ, I0 or NPHI, and based on the limited code you have provided, the following change appears to be justified and could improve the performance.
If NJ >> num_threads, then collapse is not necessary, bit if NJ is smaller than I0 (not ten?) then collapse(2) would be effective.
Modify to parallel code
    dimension dphi(8)

    !$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I,J,K,dphi) COLLAPSE(2)
      DO J=1,NJ
       DO I=1,I0
        dphi(1:8) = ( PHI(1:8,I+1,J)-PHI(1:8,I,J) ) * BAJ1(I,J)
        
        DO K=1,NPHI
          F(K,I,J) = dot_product ( AM1(K,1:8,I,J), dphi(1:8) ) 
        ENDDO
        
       ENDDO
      ENDDO
    !$OMP END PARALLEL DO
Changing the definition of AM1(1:NPHI,1:8,I,J) to AM1(1:8,1:NPHI,I,J) would improve this loop although you would need to consider other effects.

You would need to consider these changes, based on your knowledge of other parts of the code.

Dear John,

Thank you for your advice.

In this code, NJ, I0 is the number of grid points in Oxy domain. Those are 70 and 279, respectively. NPHI is number of equation need to solve and equal to 8.

I changed above part in the code base on your suggestion (now NPHI=4 instead of 8). Here are some parts of the code I use the OMP parralel clauses (I changed the code (that posted here before) base on Jim's comments at the post below).

If you have time, please take a look and give me your advices.

Thank you very much,

C ------------------------------------------
C --------------- MAIN ROOP ----------------
C ---------------\/ \/ \/ \/----------------
C ------------------------------------------
      DO ITIMST=1,NTIMST
           DO J=1,NJ
           DO I=1,NI
           RN1(I,J)=ROU(I,J)
           UN1(I,J)=  U(I,J)
           VN1(I,J)=  V(I,J)
           WN1(I,J)=WAI(I,J)
           PN1(I,J)=  P(I,J)
           TN1(I,J)=  T(I,J)
           EN1(I,J)=  E(I,J)
           ENDDO
           ENDDO
C     
      CALL SOLPHIX1
      CALL TVD
      CALL SOLPHIX2
C   
           DO J=1,NJ
           DO I=1,NI
           PT1(I,J)=PT1(I,J)+  P(I,J)
           RT1(I,J)=RT1(I,J)+ROU(I,J)
           UT1(I,J)=UT1(I,J)+  U(I,J)
           VT1(I,J)=VT1(I,J)+  V(I,J)
           WT1(I,J)=WT1(I,J)+WAI(I,J)
           TT1(I,J)=TT1(I,J)+  T(I,J)
           ET1(I,J)=ET1(I,J)+  E(I,J)
           ENDDO
           ENDDO
C    
         IF(MOD(ITIMST,ITPRI).EQ.0) CALL RESIDUAL
         IF(MOD(ITIMST,ITOUT).EQ.0) THEN
            NOUT=NOUT+1
            CALL OUTPUT(NOUT)
            ENDIF
      ENDDO
C ------------------------------------------
C ---------------/\ /\ /\ /\----------------
C --------------- MAIN ROOP ----------------
C ------------------------------------------
      DO J=1,NJ
      DO I=1,NI
      PT1(I,J)=PT1(I,J)/REAL(NTIMST)
      RT1(I,J)=RT1(I,J)/REAL(NTIMST)
      UT1(I,J)=UT1(I,J)/REAL(NTIMST)
      VT1(I,J)=VT1(I,J)/REAL(NTIMST)
      WT1(I,J)=WT1(I,J)/REAL(NTIMST)
      ENDDO
      ENDDO

      OPEN(4,FILE='./AV/av')
      WRITE(4,*) RT1,UT1,VT1,PT1,WT1
      CLOSE(4)
C ***
 5000 CONTINUE
 2000 FORMAT('  IterNo', '    dRmax ', '   dUmax ','   dVmax ',
     &           '     dPmax ', '     dEmax ', '     dTmax ',
     &           '      CFL')
 1000 FORMAT(1X, 'RUN#',I2,'    ',I4,'/',I2,'/',I2, ' (HGrid)')
      STOP
      END

C--------------------------------------------------------------------
      SUBROUTINE METRIC                                   ! METRIC
C--------------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'
C
      CALL DIFCX(X,W1)             ! W1=XXI
      CALL DIFCX(Y,W2)             ! W2=YXI
      CALL DIFCY(X,W3)             ! W3=XET
      CALL DIFCY(Y,W4)             ! W4=YET
C
      !$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ                  ! AJ=XXI*YET-XET*YXI
      DO I=1,NI
      AJ(I,J) = W1(I,J)*W4(I,J)-W2(I,J)*W3(I,J)
        IF(AJ(I,J).LE.0.) write(*,*)'I=',I,' J=',J,' AJ=',AJ(I,J)
      XIX(I,J)= W4(I,J)/AJ(I,J)
      ETX(I,J)=-W2(I,J)/AJ(I,J)
      XIY(I,J)=-W3(I,J)/AJ(I,J)
      ETY(I,J)= W1(I,J)/AJ(I,J)
      G11(I,J)= XIX(I,J)**2+XIY(I,J)**2
      G22(I,J)= ETX(I,J)**2+ETY(I,J)**2
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
! 
      RETURN
      END

C--------------------------------------------------------------
      SUBROUTINE PHICAL
C--------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'
C
      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,NI
         PHI(1,I,J)=   AJ(I,J)*ROU(I,J)
         PHI(2,I,J)=PHI(1,I,J)*  U(I,J)
         PHI(3,I,J)=PHI(1,I,J)*  V(I,J)
         PHI(4,I,J)=PHI(1,I,J)*WAI(I,J)
         PHI(5,I,J)=   AJ(I,J)*  E(I,J)
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!
      RETURN
      END

C--------------------------------------------------------------
      SUBROUTINE PHINCAL
C--------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'
C
      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,NI
         PHIN(1,I,J)=    AJ(I,J)*ROU(I,J)
         PHIN(2,I,J)=PHIN(1,I,J)*  U(I,J)
         PHIN(3,I,J)=PHIN(1,I,J)*  V(I,J)
         PHIN(4,I,J)=PHIN(1,I,J)*WAI(I,J)
         PHIN(5,I,J)=    AJ(I,J)*  E(I,J)
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!
      RETURN
      END

C -----------------------------------------------------------------
      SUBROUTINE SOLPHIX1                                    ! SOLPHIX1
C -----------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'

C //////////////// Predictor Step ///////////////
      CALL CONTCAL
      CALL PHICAL       
      CALL FLUXX
      CALL VISX
C
      !$OMP PARALLEL DO COLLAPSE(3) DEFAULT(SHARED) PRIVATE(I,J,K) 
      DO J=1,NJ
      DO I=2,I0
      DO K=1,NPHI
         PHIN(K,I,J)=PHI(K,I,J)
     &              -DT5*FL(K,I,J)
     &              +DT5*DH*(EV(K,I,J)-EV(K,I-1,J))
      ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!      
      DO J=1,NJ
      DO K=1,NPHI                             ! dumy
         PHIN(K, 1,J)=PHIN(K, 2,J)
         PHIN(K,NI,J)=PHIN(K,I0,J)
      ENDDO
      ENDDO
      CALL VARICAL
      CALL BOUNDW
      CALL BOUNDIO
C
C /////////////// Corrector Step ///////////////
      CALL CONTCAL
      CALL PHINCAL    
      CALL FLUXX
      CALL VISX 
C
      !$OMP PARALLEL DO COLLAPSE(3) DEFAULT(SHARED) PRIVATE(I,J,K)  
      DO J=1,NJ
      DO I=2,I0
      DO K=1,NPHI
         PHIN(K,I,J)=0.5D0*(PHI(K,I,J)+PHIN(K,I,J))
     &              -0.5D0*DT5*FL(K,I+1,J)
     &              +0.5D0*DT5*DH*(EV(K,I,J)-EV(K,I-1,J))
      ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!
      DO J=1,NJ                                             
      DO K=1,NPHI                          ! dumy
         PHIN(K, 1,J)=PHIN(K, 2,J)
         PHIN(K,NI,J)=PHIN(K,I0,J)
      ENDDO
      ENDDO
      CALL VARICAL
      CALL BOUNDW
      CALL BOUNDIO
      RETURN
      ENDDO
C------------------------------------------------------------------
C------------------------------------------------------------------
      SUBROUTINE TVD                                       ! TVD
C------------------------------------------------------------------
C------------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,A1,A2,A3,CP) 
      DO J=1,NJ
      DO I=1,I0
         BROU(I,J)=.5*(ROU(I  ,J)+ROU(I+1,J))
                BR=1./BROU(I  ,J)
         BWAI(I,J)=.5*(ROU(I  ,J)*WAI(I  ,J)
     &                +ROU(I+1,J)*WAI(I+1,J))*BR
         BE(  I,J)=.5*(  E(I  ,J)+  E(I+1,J))
         BP(  I,J)=.5*(  P(I  ,J)+  P(I+1,J))
         BT(  I,J)=.5*(  T(I  ,J)+  T(I+1,J))
         BU(  I,J)=.5*(ROU(I  ,J)*  U(I  ,J)
     &                +ROU(I+1,J)*  U(I+1,J))*BR
         BV(  I,J)=.5*(ROU(I  ,J)*  V(I  ,J)
     &                +ROU(I+1,J)*  V(I+1,J))*BR
         BUC( I,J)=BXIX1(I,J)*BU(I,J)
     &            +BXIY1(I,J)*BV(I,J)
         BVC( I,J)=BETX1(I,J)*BU(I,J)
     &            +BETY1(I,J)*BV(I,J)
       BON(I,J)=SQRT(A3/(A1-A2/CP)*BT(I,J))                      ! Onsoku
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Set eigen vector [RA^-1]

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J) 
      DO J=1,NJ
      DO I=1,I0
      AM1(1,1,I,J)=BWAI(I,J)/(BWAI(I,J)-BETA(I,J))
      AM1(2,1,I,J)=0.D0
      AM1(3,1,I,J)=0.D0
      AM1(4,1,I,J)=-1./(BWAI(I,J)-BETA(I,J))
      AM1(1,2,I,J)=-.5*(BUC(I,J)/BON(I,J)/SQRT(BG11(I,J))
     &                    +BETA(I,J)/(BWAI(I,J)-BETA(I,J)))
      AM1(2,2,I,J)= .5*BXIX1(I,J)/BON(I,J)/SQRT(BG11(I,J))
      AM1(3,2,I,J)= .5*BXIY1(I,J)/BON(I,J)/SQRT(BG11(I,J))
      AM1(4,2,I,J)= .5/(BWAI(I,J)-BETA(I,J))
      AM1(1,3,I,J)=(BXIY1(I,J)*BU(I,J)-BXIX1(I,J)*BV(I,J))
     &               /BG11(I,J)/BON(I,J)+BWAI(I,J)
     &               /(BWAI(I,J)-BETA(I,J))
      AM1(2,3,I,J)=-BXIY1(I,J)/BON(I,J)/BG11(I,J)
      AM1(3,3,I,J)=BXIX1(I,J)/BON(I,J)/BG11(I,J)
      AM1(4,3,I,J)=-1./(BWAI(I,J)-BETA(I,J))
      AM1(1,4,I,J)= .5*(BUC(I,J)/BON(I,J)/SQRT(BG11(I,J))
     &                     -BETA(I,J)/(BWAI(I,J)-BETA(I,J)))
      AM1(2,4,I,J)=-.5*BXIX1(I,J)/BON(I,J)/SQRT(BG11(I,J))
      AM1(3,4,I,J)=-.5*BXIY1(I,J)/BON(I,J)/SQRT(BG11(I,J))
      AM1(4,4,I,J)= .5/(BWAI(I,J)-BETA(I,J))
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Compute PHIN =Rho, Rho u, etc without 1/J

      DO J=1,NJ
      DO I=1,NI
         PHI(1,I,J)=ROU(I,J)
         PHI(2,I,J)=ROU(I,J)*  U(I,J)   
         PHI(3,I,J)=ROU(I,J)*  V(I,J)
         PHI(4,I,J)=ROU(I,J)*WAI(I,J) 
      ENDDO
      ENDDO

C *** Compute alpha A(i+1/2)=[RA^-1*dQ] ==> F(I,J)

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI,E) 
       DO J=1,NJ
       DO I=1,I0
         E(1,I,J)=(PHI(1,I+1,J)-PHI(1,I,J))*BAJ1(I,J)
         E(2,I,J)=(PHI(2,I+1,J)-PHI(2,I,J))*BAJ1(I,J)
         E(3,I,J)=(PHI(3,I+1,J)-PHI(3,I,J))*BAJ1(I,J)
         E(4,I,J)=(PHI(4,I+1,J)-PHI(4,I,J))*BAJ1(I,J)
        DO K= 1,NPHI
         F(K,I,J)= AM1(1,K,I,J)*E(1,I,J)
     &            +AM1(2,K,I,J)*E(2,I,J)
     &            +AM1(3,K,I,J)*E(3,I,J)
     &            +AM1(4,K,I,J)*E(4,I,J)
        ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Compute minmod function

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI)
      DO J=1,NJ
      DO I=2,I1
      DO K=1,NPHI
         G(K,I,J)=SIGN(1.,F(K,I,J))*MAX(0.,MIN(ABS(F(K,I,J)),
     &            SIGN(1.,F(K,I,J))*F(K,I-1,J)))
         Q(K,I,J)=SIGN(1.,F(K,I,J))*MAX(0.,MIN(ABS(F(K,I,J)),
     &            SIGN(1.,F(K,I,J))*F(K,I+1,J)))
      ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Set Limiter Function EV(I,J)=minmod F(i-1)+minmod F(i+1)-F(i+1)

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI)
      DO J=1,NJ
      DO I=1,I0
      DO K=1,NPHI
         EV(K,I,J)=G(K,I,J)+Q(K,I,J)-F(K,I,J)
      ENDDO
      ENDDO
      ENDDO  
      !$OMP END PARALLEL DO

C *** Set Eigen Value Q(I,J)=ZA  
 
      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI)  
      DO J=1,NJ
      DO I=1,I0
         Q(1,I,J)=DT5*DH* BUC(I,J)
         Q(2,I,J)=DT5*DH*(BUC(I,J)+BON(I,J)*SQRT(BG11(I,J)))
         Q(3,I,J)=DT5*DH* BUC(I,J)
         Q(4,I,J)=DT5*DH*(BUC(I,J)-BON(I,J)*SQRT(BG11(I,J)))
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Cal phi -----> PHIN(I,J)=phi A(i+1/2)
      DELTI=0.1000                               ! Control parameter

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI)
      DO J=1,NJ
      DO I=1,I0
      DO K=1,NPHI
         PHIN(K,I,J)=.5*(ABS(Q(K,I,J))-Q(K,I,J)**2)
     &              *(F(K,I,J)-EV(K,I,J))
      ENDIF
      ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Set Right eigen vector [RA]

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,I0
      AM1(1,1,I,J)= 1.D0
      AM1(2,1,I,J)= 1.D0
      AM1(3,1,I,J)= 0.D0
      AM1(4,1,I,J)= 1.D0
      AM1(1,2,I,J)= BU(I,J)+BXIY1(I,J)*BON(I,J)
      AM1(2,2,I,J)= BU(I,J)+BXIX1(I,J)*BON(I,J)/SQRT(BG11(I,J))
      AM1(3,2,I,J)= -BXIY1(I,J)*BON(I,J)
      AM1(4,2,I,J)= BU(I,J)-BXIX1(I,J)*BON(I,J)/SQRT(BG11(I,J))
      AM1(1,3,I,J)= BV(I,J)-BXIX1(I,J)*BON(I,J)
      AM1(2,3,I,J)= BV(I,J)+BXIY1(I,J)*BON(I,J)/SQRT(BG11(I,J))
      AM1(3,3,I,J)= BXIX1(I,J)*BON(I,J)
      AM1(4,3,I,J)= BV(I,J)-BXIY1(I,J)*BON(I,J)/SQRT(BG11(I,J))
      AM1(1,4,I,J)= BETA(I,J)
      AM1(2,4,I,J)= BWAI(I,J)
      AM1(3,4,I,J)= 0.D0
      AM1(4,4,I,J)= BWAI(I,J)
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** Compute TVD correction term R*phi

      !$OMP PARALLEL DO COLLAPSE(3) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,I0
      DO K=1,NPHI-1
         F(K,I,J)=AM1(1,K,I,J)*PHIN(1,I,J)
     &           +AM1(2,K,I,J)*PHIN(2,I,J)
     &           +AM1(3,K,I,J)*PHIN(3,I,J)
     &           +AM1(4,K,I,J)*PHIN(4,I,J)
      ENDDO
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C *** TVD correction for XI-direction
      DO J=1,NJ
      DO I=2,I0
      DO K=1,NPHI
         PHIM(K,I,J)=F(K,I,J)-F(K,I-1,J)
      ENDDO
      ENDDO
      ENDDO
      DO J=1,NJ
      DO K=1,NPHI 
         PHIM(K, 1,J)=2.*PHIM(K, 2,J)-PHIM(K, 3,J)
         PHIM(K,NI,J)=2.*PHIM(K,I0,J)-PHIM(K,I1,J)
      ENDDO
      ENDDO 
      RETURN
      END
C---------------------------------------------------------------
C---------------------------------------------------------------
      SUBROUTINE FLUXX
C---------------------------------------------------------------
C---------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,DJX,DJY)
      DO J=1,NJ
      DO I=2,NI
      DJX = BAJ1(I-1,J)*BXIX1(I-1,J)
      DJY = BAJ1(I-1,J)*BXIY1(I-1,J)
        FL(1,I,J)=(DJX*(ROU(I,J)*U(I,J)-ROU(I-1,J)*U(I-1,J))
     &            +DJY*(ROU(I,J)*V(I,J)-ROU(I-1,J)*V(I-1,J)))*DH
        FL(2,I,J)=(DJX*(ROU(I  ,J)*U(I  ,J)**2+P(I  ,J)
     &                 -ROU(I-1,J)*U(I-1,J)**2-P(I-1,J))
     &            +DJY*(ROU(I  ,J)*U(I  ,J)*V(I  ,J)
     &                 -ROU(I-1,J)*U(I-1,J)*V(I-1,J)))*DH
        FL(3,I,J)=(DJX*(ROU(I  ,J)*U(I  ,J)*V(I  ,J)
     &                 -ROU(I-1,J)*U(I-1,J)*V(I-1,J))
     &            +DJY*(ROU(I  ,J)*V(I  ,J)**2+P(I  ,J)
     &                 -ROU(I-1,J)*V(I-1,J)**2-P(I-1,J)))*DH
        FL(4,I,J)=(DJX*(ROU(I  ,J)*U(I  ,J)*WAI(I  ,J)
     &                 -ROU(I-1,J)*U(I-1,J)*WAI(I-1,J))
     &            +DJY*(ROU(I  ,J)*V(I  ,J)*WAI(I  ,J)
     &                 -ROU(I-1,J)*V(I-1,J)*WAI(I-1,J)))*DH
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!
      RETURN
      ENDDO
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      SUBROUTINE VISX
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,UP2,UP1,VP2,VP1)
      DO J=2,J0
      DO I=1,I0
        UP2=(U(I,J)+U(I+1,J)+U(I,J+1)+U(I+1,J+1))*.25
        UP1=(U(I,J)+U(I+1,J)+U(I,J-1)+U(I+1,J-1))*.25
        VP2=(V(I,J)+V(I+1,J)+V(I,J+1)+V(I+1,J+1))*.25
        VP1=(V(I,J)+V(I+1,J)+V(I,J-1)+V(I+1,J-1))*.25
        W1(I,J)=(U(I+1,J)-U(I,J))*DH         ! W1=UXi
        W2(I,J)=(UP2     -UP1   )*DH         ! W2=UEt
        W3(I,J)=(V(I+1,J)-V(I,J))*DH         ! W3=VXi
        W4(I,J)=(VP2     -VP1   )*DH         ! W4=VEt
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

C--------------------------------------------- PERIOD
      DO I=1,I0                            !period 1*NJ,2*NJ,0*NJ
      UP2=(U(I,     1)+U(I+1,     1)+U(I,     2)+U(I+1,     2))*.25
      UP1=(U(I,     1)+U(I+1,     1)+U(I,  NJ-1)+U(I+1,  NJ-1))*.25
      VP2=(V(I,     1)+V(I+1,     1)+V(I,     2)+V(I+1,     2))*.25
      VP1=(V(I,     1)+V(I+1,     1)+V(I,  NJ-1)+V(I+1,  NJ-1))*.25
      W1(I,     1)=(U(I+1,     1)-U(I,     1))*DH
      W2(I,     1)=(UP2          -UP1        )*DH
      W3(I,     1)=(V(I+1,     1)-V(I,     1))*DH
      W4(I,     1)=(VP2          -VP1        )*DH
      ENDDO
C--------------------------------------------------------

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,UX,UY,VX,VY,HCMYU,TAU11,TAU12,TAU22)
      DO J=1,NJ
      DO I=1,I0
        UX=W1(I,J)*BXIX1(I,J)+W2(I,J)*BETX1(I,J)
        UY=W1(I,J)*BXIY1(I,J)+W2(I,J)*BETY1(I,J)
        VX=W3(I,J)*BXIX1(I,J)+W4(I,J)*BETX1(I,J)
        VY=W3(I,J)*BXIY1(I,J)+W4(I,J)*BETY1(I,J)
        TAU11=2./3*(HCMYU           )*(2.*UX-VY)
        TAU12=     (HCMYU           )*(   UY+VX)
        TAU22=2./3*(HCMYU           )*(2.*VY-UX)

        EV(1,I,J)=0.D0
        EV(2,I,J)=BAJ1(I,J)*(BXIX1(I,J)*TAU11
     &                      +BXIY1(I,J)*TAU12)
        EV(3,I,J)=BAJ1(I,J)*(BXIX1(I,J)*TAU12
     &                      +BXIY1(I,J)*TAU22)
        EV(4,I,J)=0.D0
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

      RETURN
      END
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      SUBROUTINE VARICAL
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,NI
         ROU(I,J)=PHIN(1,I,J)/AJ(I,J)
           U(I,J)=PHIN(2,I,J)/PHIN(1,I,J)
           V(I,J)=PHIN(3,I,J)/PHIN(1,I,J)
         WAI(I,J)=PHIN(4,I,J)/PHIN(1,I,J)
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO

      DO I=1,NI
         WAI(I,J)=MAX(0.d+0,WAI(I,J))
         WAI(I,J)=MIN(1.d+0,WAI(I,J))
      ENDDO
      ENDDO

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=1,NJ
      DO I=1,NI                          !!! Inst Equi Evap Model
               A1=1.d+0-WAI(I,J)
               A2=E(I,J)-ROU(I,J)*H0-.5+0*ROU(I,J)*UU
             B1=WAI(I,J)*CR+A1*CK-CP
               C1=WAI(I,J)*CR*PC
               C2=(WAI(I,J)*CR+A1*CK)*A2
               C3=CP*(ROU(I,J)*A1*CK*TC-PC)
             B2=C1+C2+C3
             B3=WAI(I,J)*CR*PC*A2
           P(I,J)=(-B2-SQRT(ABS(B2**2-4.d+0*B1*B3)))/(2.d+0*B1)
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
!
      RETURN
      END
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      SUBROUTINE DIFCY(C,D)
C--------------------------------------------------------------------
C--------------------------------------------------------------------
      INCLUDE 'subcomMCMBLPC.inc'
      DIMENSION C(NI,NJ),D(NI,NJ)
C
      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J)
      DO J=2,J0                                  
      DO I=1,NI
         D(I,J) =(C(I,J+1)-C(I,J-1))*H1
      ENDDO
      ENDDO
      !$OMP END PARALLEL DO
C
C----------------------------------------------- !!! PERIOD
      DO I=1,NI
           DC=C(I,NJ)-C(I,NJ-1)
         D(I,   1)=(C(I,     2)-(C(I,1) -DC))*H1
         D(I,NJ  )= D(I,     1)
      ENDDO                       
C
C------------------------------------------------- !!! WALL
      DO I=IL1+1,IT1-1
         D(I,   1)=(-C(I,   1)+C(I,   2))*DH
         D(I,NJ  )=( C(I,NJ  )-C(I,NJ-1))*DH
      ENDDO
C
      RETURN
      END

jimdempseyatthecove · ‎05-11-2020

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ITMST)
DO ITIMST=1,NTIMST ...

Are we to assume NTIMST is "Number of Time Steps"?

If so, then the above PARALLEL DO will compute time steps out of order.
Each thread has a slice of the iteration space 1:NTIMST
Thread 0 has 1:NIMST/nThreads
Thread 1 has 1+NIMST/nThreads:NIMST/nThreads*2
...

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ITMST)
DO ITIMST=1,NTIMST
   ...
*** CAUTION
*** Due to each thread having a slice of the iteration space 1:NTIMST
*** the following CALL's may occur prior to necessary data is available
*** as well as may be out of order
IF(MOD(ITIMST,ITPRI).EQ.0) CALL RESIDUAL
IF(MOD(ITIMST,ITOUT).EQ.0) THEN
    NOUT=NOUT+1
    CALL OUTPUT(NOUT)
ENDIF
ENDDO
!$OMP END PARALLEL DO

Line 47 ends the parallel (DO) region.
Yet, line 54 assumes it is within a parallel region (benign problem)

Line 78 (ENTRY METRIC) is defining an alternate subroutine entry point whilst not within the context of a subroutine. As to if this is results in ENTRY being compiles as if it were declared as SUBROUTINE METRIC I cannot say. The compiler should have complained. However, if compiled as stand alone subroutine then IMPLICIT variable declarations in this case will be in effect.

Line 23 (CALL SOLPHIX1), line 24 (CALL TVD) and line 25 (CALL SOLPHIX2) are called within a parallel region (started at line 6)...
yet in subroutine SOLPHIX1 line 162 you are (attempting to) enter a nested parallel region.

From looking at the code, it does not appear that this was your intention.

Please sort out the time step and parallel region declarations as to if they are intended to be nested or not nested.

Note, coding nested parallel regions is rather advanced programming and requires great care such that oversubscription does not occur. Leave nested parallel programming for later efforts.

Also, try to avoid the use of ENTRY.

Jim Dempsey

Dinhle__Anh · ‎05-11-2020

Dear Jim,

Yes, NTIMS is the time step. So, I might miss understood and set the parallel in time may be wrong because the calculation at time (t+dt) need the information of time (t). Hence, the calculation at threat 1 does not have the reference data from threat 0. So I think I have to remove the !$OMP clause in the MAIN ROOP. Then, none of parallel regions are within another.

Line 54, my mistake. It is out of parralel region.

Line 78 is the ENTRY, that is within an SUBROUTINE. But, here I just want to show the part of code I use the parralel clause. I am sorry that make you difficult to read the code.

I modified the code in the post #13 above.

Le

jimdempseyatthecove · ‎05-11-2020

After you get your parallel version working, consider reordering your index positions of your arrays to improve vectorization opportunities of the compiler. As an example, consider the following nested loop:

      !$OMP PARALLEL DO COLLAPSE(2) DEFAULT(SHARED) PRIVATE(I,J,K,NPHI)  
      DO J=1,NJ
      DO I=1,I0
         Q(1,I,J)=DT5*DH* BUC(I,J)
         Q(2,I,J)=DT5*DH*(BUC(I,J)+BON(I,J)*SQRT(BG11(I,J)))
         Q(3,I,J)=DT5*DH* BUC(I,J)
         Q(4,I,J)=DT5*DH*(BUC(I,J)-BON(I,J)*SQRT(BG11(I,J)))
      ENDDO
      ENDDO

Code that is suitable for vectorization is such where the array access of the loop for adjacent indexes (left most in Fortran, rightmost in C) are in adjacent memory locations on both the right hand side and left hand side of the =.

In the loop above, the right hand side first index I, increments across adjacent memory whereas the left hand side, the I index is the second index and (appears to) have a stride of 4. This will result in the above loop either: a) not being vectorized, or b) vectorized inefficiently using scatter store, or c) vector on right hand side with convolution instructions into temporary registers prior to scalar stores (less efficient than a).

To correct for this consider using Q(I,J,1), Q(I,J,2)...
**** only after consideration of the impact on its use elsewhere in your program.

NOTE, make the index reorganization AFTER getting the program to run properly.

Jim Dempsey

John_Campbell · ‎05-12-2020

Based on your code in Quote #13, I would make some comments.

I don't like DEFAULT(SHARED) but prefer explicit SHARED(...), but with extensive use of INCLUDE variables this might not be practical. ( not sure of OMP instruction length in 72 column fixed format ?) Also like IMPLICIT NONE !

With NJ = 70, I would not use COMBINE as with NJ >> num_threads. I don't think there would be too many gains.

Instead I would try to merge the NJ !$OMP areas, especially in TVD and VARCAL it appears useful.

I would be very careful to identify all PRIVATE variables (in VARICAL :A1,A2,B1,B2,B3,C1,C2,C3, and in TVD:BR)

I think VISX has a private error : HCMYU appears SHARED

In PHICAL and PHINCAL, introduced private PH_l = AJ(I,J)*ROU(I,J) to minimise mixed use of PHI

In TVD made extensive merging of OMP regions, BR appears private and A1,A2,A3,CP appear shared.
In VARICAL, merged OMP regions. Also, some formula are unclear "1.d+0-WAI", "-.5+0*ROU", "-4.d+0*B1" and "2.d+0*B1". 1., .5, 2. and 4. are all usable reals. Was there corruption of including D0 ?

I am attaching the changes. You may wish to compare with your original code.

Make sure you compare the results to the serial code, allowing for changes in round-off errors.

You or Jim may disagree, so only use the recommendations you are happy with.

Good luck.

jimdempseyatthecove · ‎05-12-2020

John has provided some good advice.

Some additional guidance:

I see placed in your code where it appears that your real variables are REAL(8). By default real literals (e.g. 0.25) are single precision (REAL(4)). Due to the compiler being designed to desire to produce results consistent with older systems and compilers it will not, by default, promote literals at compile time in places where the literal is used to produce a larger real. Instead it will promote at run time. This results in some additional overhead, sometimes small, other times large, but more importantly it can result in loss in precision that you may or may not wish to preserve. For example 0.1, Pii, etc... have infinite repeating binary fraction. The difference may be significant, when this difference is desired to be kept, be sure that you designate the literal as double precision (example 0.1D0).

The Intel Parallel Studio comes with a very nice tool: VTune I suggest you use it to locate where your code may benefit with parallelization.

Entering a parallel region has startup overhead and exiting a parallel region has thread team barrier overhead. And similar overheads for enclosed $OMP DO within $omp parallel. The amount of work inside the loop would depend on if parallelization is warranted or not. Additionally, statements can be computationally bound or memory bound. Memory bound regions can sometimes benefit with using fewer threads than the full complement of threads. Again VTune is your friend and can help guide you in identifying characteristics of your statements.

Jim Dempsey

Dinhle__Anh · ‎05-12-2020

Dear John,

Thank you very much for carefully checking the code.

In SOLPHIX1 and SOLPHIX2, I think the original code is more efficient. If merge NJ !&OMP the PHIN(K,1,J)=PHIN(K,2,J) is repeated I0-1 time.

In TVD and VARICAL, your modification may be better since it can reduce a number of same DO J=... .

Other question relating on the PRIVATE(...), I modified the code, but forgot to remove some of them.

Interesting that, I run and compared both serial bode and parallel code (that's before Jim's and your comment):

- The result was same in both run

- No difference in calculation time (about 4~5 minutes/20000 time iteration)

So I think the current mesh (279 x 70 points) cannot show the advantage of parallel calculation.

Now I will modified the code based on your suggestion and Jim's advice (change the order of index). Then run the code again.

Best,

jimdempseyatthecove · ‎05-12-2020

>>No difference in calculation time (about 4~5 minutes/20000 time iteration)

I suggest you make test runs of serial and parallel code using VTune and compare the results.

To see the same runtimes for both serial and parallel runs is quite unusual...
... excepting in conditions where the program is limited by file I/O time

Jim Dempsey

John_Campbell · ‎05-13-2020

I was surprised there was no difference in calculation time, so I checked it out.

I think the reason you are not getting OMP performance is your "!$OMP" might not be recognised in a fixed format file. (I no longer use fixed format)
You should use:
C$OMP starting in column 1:5, as your file is fixed format.
For a continuation, use C$OMP& in column 1:6, eg in routune VISX
C$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I,J,UX,UY,VX,VY,
C$OMP& TAU11,TAU12,TAU22)

The idea behind merging DO NJ loops is to reduce the number of OMP region entries, although it probably does not have a significant effect (about 10%).
With NJ = 70, COLLAPSE is not a significant gain, so merging "DO J=1,NJ" OMP regions is simpler.

You also need to confirm all local PRIVATE variables are correctly identified. ( In TVD, a local E(4,MI,MJ) array is different from the global E(MI,MJ) array)

I managed to adapt the code you listed in Quote #13 to get it working as a complete program, although with only a 40% improvement for 4 threads compared to a non-OMP run. (possibly due to updating lots of shared arrays)
Not a great improvement, but it is an improvement.

To adapt the code, I created a data module to replace the subcomMCMBLPC.inc you did not list, a USE statement in subcomMCMBLPC.inc and commented out the calls to routines I did not have code for.
The net result is the program runs through the calculations available, but does not produce meaningful results.
I also created an initialise routine to remove uninitialised variable errors and added timing routines to confirm OMP performance gains.
My profiling of the adapted serial code showed the % run time distribution was TVD:53%, VARICAL:12%, VISX:10% and FLUXX:9%, so these would be good places to check for improving the loops and C$OMP. Other areas are less significant.
I recommend you look at the changes I have made and adopt those that are useful.

Jim, the real values 0.5, 1., 2. and 4. are probably better left as is, as the truncated bits are all zero.
There is a good argument for don't change, as the changes can introduce errors. (other constants where the 64-bit value is different from the 32-bit due to truncated bits being non-zero might need correcting, eg 0.1 to 0.1D0)

I hope this helps give you an understanding of what is required.

regards,

John

The file subcomMCMBLPC.inc, for this uploaded copy is:

      use jdc_module
      IMPLICIT NONE
C

(.inc files can not be uploaded)