Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

change run time just for a line

diedro
Beginner
3,073 Views

Dear all,

I wrote a very complex code. I mean a code with many subroutines and functions. I used to worked fine. Last few days I noticed that the code became very very slow. After that I noticed that this depends by a single line. If I comment this line the code became very fast again.

To be more clear, this is a peace of may code:

                  CALL PDEPositivity(status,QR)
                          IF(status.LT.0) THEN
                             QR = QQ(:,jp)
                          ENDIF
                     VL(1)  = QL(2)/QL(1)
                     VL(2)  = QL(3)/QL(1)
                     VR(1)  = QR(2)/QR(1)
                     VR(2)  = QR(3)/QR(1)
                     
                      UU  = 0.5d0*(VL+VR)
                      !
                      VL=0.d0
                      VR=0.d0
                      UU=0.d0
                      VPL=0.d0
                      VPR=0.d0
                      !
                      FL = PDEflux(nVar,UU,QL(:))
                      FR = PDEflux(nVar,UU,QR(:))
                      FI = PDEflux(nVar,VPL,QQ(:,ip))
                      Gij =  0.5d0*(FL+FR)-FI
                      Vij =  0.5d0*(VL+VR)-VPL
! 
                      Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)

If I comment the line

      Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)

all became fast again.

I do bot understand why, it looks a very simple line.

Someone could explain me why or where I am wrong?

Thanks alot

0 Kudos
20 Replies
mecej4
Honored Contributor III
3,060 Views

I am suspicious that in simplifying and extracting a portion of your code you have taken away too much. First of all, the line that you finger as being the culprit has another problem. Based on what you have shown, one can argue that that line does nothing, since VPR is still zero.  Nor do we know which variables are arrays and which ones are scalars, how they were declared (static, automatic or allocatable?), etc.

0 Kudos
diedro
Beginner
3,060 Views

Dear all, Dear mecej4,

thanks. This is the subroutine:

SUBROUTINE MWSPH(nVar,npt,TayOrder,Ncoeff,Rbc,x0,y0,BCid,RP,LIJ,VOL,QQ,QQmls,LLn,VFN,istep)
 USE MOD_COMMONVARS, ONLY: FluxType
 USE MOD_NUMERICALFLUXES_RUSANOV
 USE MOD_NUMERICALFLUXES_OSHER
 USE MOD_KERNEL
 USE MOD_LISTPART
 USE MOD_GASCARAT
 USE MOD_VARS_MLS
 USE MOD_RECONSTRUCTION
 USE MOD_PDE
 USE MOD_PDEflux
 IMPLICIT NONE
 INTEGER                              ,INTENT(IN)  :: nVar
 INTEGER                              ,INTENT(IN)  :: npt
 INTEGER                              ,INTENT(IN)  :: TayOrder
 INTEGER                              ,INTENT(IN)  :: Ncoeff
 REAL                                 ,INTENT(IN)  :: Rbc
 REAL                                 ,INTENT(IN)  :: x0
 REAL                                 ,INTENT(IN)  :: y0
 INTEGER ,DIMENSION(npt)              ,INTENT(IN)  :: BCid
 REAL    ,DIMENSION(2,npt)            ,INTENT(IN)  :: RP
 REAL    ,DIMENSION(npt)              ,INTENT(IN)  :: Lij
 REAL    ,DIMENSION(npt)              ,INTENT(IN)  :: VOL
 REAL    ,DIMENSION(nVar,npt)         ,INTENT(IN)  :: QQ
 REAL    ,DIMENSION(Ncoeff,npt,nVar)  ,INTENT(IN)  :: QQmls
 REAL    ,DIMENSION(nVar+1,npt)       ,INTENT(OUT) :: LLn
 REAL    ,DIMENSION(2,npt)            ,INTENT(OUT) :: VFN
 INTEGER                              ,INTENT(IN)  :: istep
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 INTEGER                                  :: ip,jp,ipc,jpc,ii,jj,status,iVar
 REAL                                     :: r
 REAL                                     :: dxL,dxR,dyL,dyR,rr,sumW
 REAL     ,DIMENSION(nVar)                :: QL,STAMP
 REAL     ,DIMENSION(nVar)                :: QR
 REAL     ,DIMENSION(2)                   :: VL
 REAL     ,DIMENSION(2)                   :: VR
 REAL     ,DIMENSION(2)                   :: VPL,VPR
 REAL                                     :: W
 REAL     ,DIMENSION(2)                   :: dW
 REAL     ,DIMENSION(2)                   :: xij
 REAL     ,DIMENSION(2)                   :: Vij
 REAL     ,DIMENSION(Ncoeff)              :: VECT_L
 REAL     ,DIMENSION(Ncoeff)              :: VECT_R
 REAL     ,DIMENSION(2)                   :: UU
 REAL     ,DIMENSION(nVar,nVar)           :: CIJ
 REAL     ,DIMENSION(nVar,2)              :: FL,FR,FI,Gij
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 LLn=0.d0
 VFN=0.d0
 !
!  OPEN(UNIT=1, FILE='./output/particle_ip.dat', STATUS='unknown', ACTION='write',RECL=800)
 !
DO ip=1,npt
   write(*,*) ip,npt
   sumW = 0.d0
   IF(BCid(ip).EQ.0)THEN !bc
         ipc  = PART(ip)%mcii
         jpc  = PART(ip)%mcjj
         DO ii = ipc-1,ipc+1
             DO jj = jpc-1,jpc+1
                IF(ASSOCIATED(LISTPART(ii,jj)%head)==.true.) THEN
                CALL dlist_begin(LISTPART(ii,jj))
                DO !jp=1,npt
                  jp =   LISTPART(ii,jj)%current%value
                     CALL KERNEL(rp(:,ip),rp(:,jp),lij(ip),lij(jp),W,dW,xij)
                     r = SQRT( (rp(1,ip)-rp(1,jp))**2.d0 + (rp(2,ip)-rp(2,jp))**2.d0)

                     VPL(1) = QQ(2,ip)/QQ(1,ip)
                     VPL(2) = QQ(3,ip)/QQ(1,ip)
                     VPR(1) = QQ(2,jp)/QQ(1,jp)
                     VPR(2) = QQ(3,jp)/QQ(1,jp)
!                      !left state
                     dxL = (rp(1,jp)-rp(1,ip))/2.d0
                     dyL = (rp(2,jp)-rp(2,ip))/2.d0
                     CALL RECONSTRUCTION
                     DO iVar=1,nVar
                        QL(iVar) = QQ(iVar,ip) + DOT_PRODUCT((MLScoeff(:)*QQmls(:,ip,iVar)),VECT_L(:))
                     ENDDO
                        CALL PDEPositivity(status,QL)
                          IF(status.LT.0) THEN
                             QL = QQ(:,ip)
                          ENDIF
!                      !right state
                     dxR = -(rp(1,jp)-rp(1,ip))/2.d0
                     dyR = -(rp(2,jp)-rp(2,ip))/2.d0
                     CALL RECONSTRUCTION
                      DO iVar=1,nVar
                         QR(iVar) = QQ(iVar,jp) + DOT_PRODUCT((MLScoeff(:)*QQmls(:,jp,iVar)),VECT_R(:))
                      ENDDO
                          CALL PDEPositivity(status,QR)
                          IF(status.LT.0) THEN
                             QR = QQ(:,jp)
                          ENDIF
                     VL(1)  = QL(2)/QL(1)
                     VL(2)  = QL(3)/QL(1)
                     VR(1)  = QR(2)/QR(1)
                     VR(2)  = QR(3)/QR(1)
                     
                      UU  = 0.5d0*(VL+VR)
                      !
                      VL=0.d0
                      VR=0.d0
                      UU=0.d0
                      VPL=0.d0
                      VPR=0.d0
                      !
                      FL = PDEflux(nVar,UU,QL(:))
                      FR = PDEflux(nVar,UU,QR(:))
                      FI = PDEflux(nVar,VPL,QQ(:,ip))
                      Gij =  0.5d0*(FL+FR)-FI
                      Vij =  0.5d0*(VL+VR)-VPL
! 
                      Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)
                      sumW=sumW+W*VOL(jp)
! !                      
                      VFn=0.d0
! 

                          CALL RUSANOV(QL,QR,XIJ,UU,CIJ)

                     LLn(1:nVar,ip) = LLn(1:nVar,ip) - VOL(ip)*VOL(jp)*( MATMUL((2.d0*GIJ),dW) - MATMUL(CIJ,(QR-QL))*DOT_PRODUCT(xij,dW) )
                     LLn(nVar+1,ip) = LLn(nVar+1,ip) + VOL(ip)*VOL(jp)*(2.d0*DOT_PRODUCT(Vij,dW))
!                      !
! !                      if(ip==8098.and.istep==2.and.ii==ipc.and.jj==jpc+1.and.jp==7167)then
! !                         write(1,*) ip, ',', RP(1,ip),',',RP(2,ip),',',jp, ',', RP(1,jp),',',RP(2,jp),',',STAMP(1)
! !                         write(1,*) VOL(ip),VOL(jp)
! !                         write(1,*) QL
! !                         write(1,*) QR
! !                         write(1,*) '----'
! !                         write(1,*) QQmls(:,ip,1)
! !                         stop
! !                      endif
!                      !
!                      !
                    IF(.NOT. dlist_move_next(LISTPART(ii,jj))) EXIT
                  ENDDO
                ENDIF
               ENDDO
           ENDDO
           Vfn(:,ip) = Vfn(:,ip)/sumW
   ELSE
     Vfn(:,ip) = [0.d0,0.d0]
   ENDIF

ENDDO

!  CLOSE(1)
 !
ENDSUBROUTINE

Problably, I just not remember correctly the run time

0 Kudos
mecej4
Honored Contributor III
3,060 Views

All right, with the newly provided subroutine code I can see that there are only two arguments with INTENT(OUT), namely, LLn and VFN. Of these LLn may contain some non-zero values after the subroutine is called and returns, but VFN will always contain zeros. VFN is not used in computing anything else in the subroutine.

Given these findings, why is VFN an argument of the subroutine?  Why can you not simply remove VFN as an argument and remove the lines that contain VFN in the body of the subroutine?

Is there, possibly, something unusual in the modules that you USE?

0 Kudos
diedro
Beginner
3,060 Views

Dear all, Dear mecej4,

VFN is zero because I am debugging the code. After the first debugging VFN is not going to be zero.

What do you mean for "unusual in the modules that you USE"? 

I really do not understand, in another subroutine I do a lot of more operations and it is much faster.

What do you think?

thanks a lot

0 Kudos
mecej4
Honored Contributor III
3,060 Views

If you remove VFN as an argument and remove all lines of code which are unnecessary because VFN=0, does the slowdown still occur?

The reason that I asked about the modules is because your code USEs many, and there may be declarations of attributes in those modules of the variables in the subroutine. These attributes may play a role in the behavior of the code.

I am afraid that you will have to put together a complete reproducer.

0 Kudos
Steven_L_Intel1
Employee
3,060 Views

As an interesting experiment, what happens if you replace:

Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)

with:

Vfn(1,ip) = Vfn(1,ip)+VPR*W*VOL(jp)
Vfn(2,ip) = Vfn(2,ip)+VPR*W*VOL(jp)

Have you enabled optimization reports and if so, what do they say for this code?

 

 

0 Kudos
diedro
Beginner
3,060 Views

Dear all, Dear Steve,

It does not change. 

What do you mean with: "enabled optimization reports"?

What should can I do?

Thanks again

0 Kudos
Steven_L_Intel1
Employee
3,060 Views

-qopt-report=3 will give you lots of information about optimizations performed or not performed.

Can you upload a gzipped tar file with all of the sources needed to compile this subroutine? (Sources for all of the modules.)

0 Kudos
diedro
Beginner
3,060 Views

Dear All, Dear Steve,

this is the message when I try to use  '-qopt-report=3'.

I compile with:

ifort -r8 *.f90 -qopt-report=3 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential

and here the message:

ifort: command line warning #10006: ignoring unknown option '-qopt-report=3

Unfourtantly, I could not share my code, at least for the moment. I have to write scientific paper on it.

Do you have some suggestions?

Thanks a lot

0 Kudos
TimP
Honored Contributor III
3,060 Views

In past versions of Ifort, opt-report was spelled without q or =.

 

0 Kudos
diedro
Beginner
3,060 Views

Dear All, Dear Tim, Dear Steve,

bellow you can find the report of the "-report=3".

What do you think?

[See attached file - Steve]

thanks again.

0 Kudos
Steven_L_Intel1
Employee
3,060 Views

I changed the 500KB of inline text to an attachment.

0 Kudos
diedro
Beginner
3,060 Views

Dear Steve, Dear all,

Thanks a lot,

Are you able to see something from that file?

If yes, could you explain it to me?

I am not able to interpret it.

Thanks again

0 Kudos
Steven_L_Intel1
Employee
3,060 Views

This is not really my area of expertise - others in this thread might understand it better. I didn't see anything obvious, though we don't know if the source you quoted is the exact file or only part of it, so the line numbers may be off. It looked to me as if the compiler optimized the routine well. It would be interesting to see the report from just mwsph.f90 when you don't add the statement to compare. It might be that the compiler decided that without the assignment it could throw away a lot of code.

0 Kudos
diedro
Beginner
3,060 Views

Dear all, Dear Steve,

in the attachment you can find the report only for the mwsph.f90/mwspf sub file.

What do you think?

Thanks again

0 Kudos
mecej4
Honored Contributor III
3,060 Views

I have seen this thread wander without focus.

We do not have a single source code that can be actually compiled (because many of the modules are missing), and yet Diedro is asking us to interpret 500 kB vectorization reports. The code provided in #3 contains some usages that have dubious justification and that are likely to hinder optimization -- for example, comparing the value of a logical variable to .true., raising a real number to the power 2.0d0 when 2 would have sufficed and having variables set but never appearing to be used (r, for example). To top it off, there are subroutine calls with no information given on what happens in those subroutines.

To me the task looks fruitless. There are too many unknown factors and it is unreasonable to expect a solution to be given despite the presence of those factors.

 

0 Kudos
diedro
Beginner
3,060 Views

Dear all, Dear mecej4  ,

thanks a lot for your comment. I have put 2 instead of 2.0. As I told you I could not post the entire code. I do not know much about optimizations.

Here the subroutine with the essential parts:

MODULE MOD_MWSPH
 CONTAINS
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 SUBROUTINE MWSPH(nVar,npt,TayOrder,Ncoeff,Rbc,x0,y0,BCid,RP,LIJ,VOL,QQ,QQmls,LLn,VFN,istep)
 USE MOD_COMMONVARS, ONLY: FluxType
 USE MOD_NUMERICALFLUXES_RUSANOV
 USE MOD_NUMERICALFLUXES_OSHER
 USE MOD_KERNEL
 USE MOD_LISTPART
 USE MOD_GASCARAT
 USE MOD_VARS_MLS
 USE MOD_RECONSTRUCTION
 USE MOD_PDE
 USE MOD_PDEflux
 IMPLICIT NONE
 INTEGER                              ,INTENT(IN)  :: nVar
 INTEGER                              ,INTENT(IN)  :: npt
 INTEGER                              ,INTENT(IN)  :: TayOrder
 INTEGER                              ,INTENT(IN)  :: Ncoeff
 REAL                                 ,INTENT(IN)  :: Rbc
 REAL                                 ,INTENT(IN)  :: x0
 REAL                                 ,INTENT(IN)  :: y0
 INTEGER ,DIMENSION(npt)              ,INTENT(IN)  :: BCid
 REAL    ,DIMENSION(2,npt)            ,INTENT(IN)  :: RP
 REAL    ,DIMENSION(npt)              ,INTENT(IN)  :: Lij
 REAL    ,DIMENSION(npt)              ,INTENT(IN)  :: VOL
 REAL    ,DIMENSION(nVar,npt)         ,INTENT(IN)  :: QQ
 REAL    ,DIMENSION(Ncoeff,npt,nVar)  ,INTENT(IN)  :: QQmls
 REAL    ,DIMENSION(nVar+1,npt)       ,INTENT(OUT) :: LLn
 REAL    ,DIMENSION(2,npt)            ,INTENT(OUT) :: VFN
 INTEGER                              ,INTENT(IN)  :: istep
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 INTEGER                                  :: ip,jp,ipc,jpc,ii,jj,status,iVar
 REAL                                         :: dxL,dxR,dyL,dyR,rr,sumW
 REAL     ,DIMENSION(2)          :: VPL,VPR
 REAL                                         :: W
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 LLn=0.d0
 VFN=0.d0


DO ip=1,npt
   write(*,*) ip,npt
   sumW = 0.d0
   IF(BCid(ip).EQ.0)THEN !bc
         ipc  = PART(ip)%mcii
         jpc  = PART(ip)%mcjj
         DO ii = ipc-1,ipc+1
             DO jj = jpc-1,jpc+1
                IF(ASSOCIATED(LISTPART(ii,jj)%head)==.true.) THEN
                CALL dlist_begin(LISTPART(ii,jj))
                DO !jp=1,npt
                  jp =   LISTPART(ii,jj)%current%value
                  !       
                  Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)

                   sumW=sumW+W*VOL(jp)                  
                   VFn=0.d0

                    IF(.NOT. dlist_move_next(LISTPART(ii,jj))) EXIT
                  ENDDO
                ENDIF
               ENDDO
           ENDDO
           Vfn(:,ip) = Vfn(:,ip)/sumW
   ELSE
     Vfn(:,ip) = [0.d0,0.d0]
   ENDIF

ENDDO

 Vfn(:,:) = 0.d0


ENDSUBROUTINE
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ENDMODULE

 

I have remove all the subroutine inside mwsph.90 and still the subroutione mwsph is slow if i use:

 Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)
 sumW=sumW+W*VOL(jp)

I am sorry to insist and bore you, but I am here to understand and study.

Any helps and suggestion will be very very appreciate.

Diego

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,060 Views

If you are running Debug build, the default for the Runtime Check of: Check array subscript for out of range, is enabled. This adds considerable overhead. Once you are assured the file containing the subroutine is NOT accessing arrays with subscripts out of bounds, you can disable the array subscript checks for that file (or all files in the project). Consider having multiple Debug builds:

Debug - single threaded, all runtime checks
DebugNoBounds - single threaded, all runtime checks except for array subscript checks
DebugOpenMP - OpenMP threaded, all runtime checks
DebugOpenMPNoBounds - OpenMPthreaded, all runtime checks except for array subscript checks
...

Jim Dempsey

 

 

0 Kudos
TimP
Honored Contributor III
3,060 Views

The quoted report shows allocation and copying of a temporary array at the location where you report stalls. That's often a bad sign.  It could easily take more than twice as long as equivalent code without the temporary array.

It looks like you have potential dead code as you seem to populate a section of the array and then zero out the entire array.  The actions of the compiler in dealing with dead code aren't easily predictable, but may not be as thorough here as they would be in a potentially vectorizable outer loop.

0 Kudos
diedro
Beginner
2,969 Views

Dear all, Dear jimdempseyatthecove, Dear Tim,

now I have compiled just two times  instead of one and it becames faster again. I have compiled it before with 

 ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential  -CB
 ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential  -CB

After with 

ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential 
ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential 

Now I need just one.

I really do not understand why. It is because the subroutine is inside a module?

Do you want that I upload the report fine another time?

Thanks a lot, I am learning a lot

0 Kudos
Reply