- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I wrote a very complex code. I mean a code with many subroutines and functions. I used to worked fine. Last few days I noticed that the code became very very slow. After that I noticed that this depends by a single line. If I comment this line the code became very fast again.
To be more clear, this is a peace of may code:
CALL PDEPositivity(status,QR) IF(status.LT.0) THEN QR = QQ(:,jp) ENDIF VL(1) = QL(2)/QL(1) VL(2) = QL(3)/QL(1) VR(1) = QR(2)/QR(1) VR(2) = QR(3)/QR(1) UU = 0.5d0*(VL+VR) ! VL=0.d0 VR=0.d0 UU=0.d0 VPL=0.d0 VPR=0.d0 ! FL = PDEflux(nVar,UU,QL(:)) FR = PDEflux(nVar,UU,QR(:)) FI = PDEflux(nVar,VPL,QQ(:,ip)) Gij = 0.5d0*(FL+FR)-FI Vij = 0.5d0*(VL+VR)-VPL ! Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)
If I comment the line
Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)
all became fast again.
I do bot understand why, it looks a very simple line.
Someone could explain me why or where I am wrong?
Thanks alot
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am suspicious that in simplifying and extracting a portion of your code you have taken away too much. First of all, the line that you finger as being the culprit has another problem. Based on what you have shown, one can argue that that line does nothing, since VPR is still zero. Nor do we know which variables are arrays and which ones are scalars, how they were declared (static, automatic or allocatable?), etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all, Dear mecej4,
thanks. This is the subroutine:
SUBROUTINE MWSPH(nVar,npt,TayOrder,Ncoeff,Rbc,x0,y0,BCid,RP,LIJ,VOL,QQ,QQmls,LLn,VFN,istep) USE MOD_COMMONVARS, ONLY: FluxType USE MOD_NUMERICALFLUXES_RUSANOV USE MOD_NUMERICALFLUXES_OSHER USE MOD_KERNEL USE MOD_LISTPART USE MOD_GASCARAT USE MOD_VARS_MLS USE MOD_RECONSTRUCTION USE MOD_PDE USE MOD_PDEflux IMPLICIT NONE INTEGER ,INTENT(IN) :: nVar INTEGER ,INTENT(IN) :: npt INTEGER ,INTENT(IN) :: TayOrder INTEGER ,INTENT(IN) :: Ncoeff REAL ,INTENT(IN) :: Rbc REAL ,INTENT(IN) :: x0 REAL ,INTENT(IN) :: y0 INTEGER ,DIMENSION(npt) ,INTENT(IN) :: BCid REAL ,DIMENSION(2,npt) ,INTENT(IN) :: RP REAL ,DIMENSION(npt) ,INTENT(IN) :: Lij REAL ,DIMENSION(npt) ,INTENT(IN) :: VOL REAL ,DIMENSION(nVar,npt) ,INTENT(IN) :: QQ REAL ,DIMENSION(Ncoeff,npt,nVar) ,INTENT(IN) :: QQmls REAL ,DIMENSION(nVar+1,npt) ,INTENT(OUT) :: LLn REAL ,DIMENSION(2,npt) ,INTENT(OUT) :: VFN INTEGER ,INTENT(IN) :: istep !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ INTEGER :: ip,jp,ipc,jpc,ii,jj,status,iVar REAL :: r REAL :: dxL,dxR,dyL,dyR,rr,sumW REAL ,DIMENSION(nVar) :: QL,STAMP REAL ,DIMENSION(nVar) :: QR REAL ,DIMENSION(2) :: VL REAL ,DIMENSION(2) :: VR REAL ,DIMENSION(2) :: VPL,VPR REAL :: W REAL ,DIMENSION(2) :: dW REAL ,DIMENSION(2) :: xij REAL ,DIMENSION(2) :: Vij REAL ,DIMENSION(Ncoeff) :: VECT_L REAL ,DIMENSION(Ncoeff) :: VECT_R REAL ,DIMENSION(2) :: UU REAL ,DIMENSION(nVar,nVar) :: CIJ REAL ,DIMENSION(nVar,2) :: FL,FR,FI,Gij !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ LLn=0.d0 VFN=0.d0 ! ! OPEN(UNIT=1, FILE='./output/particle_ip.dat', STATUS='unknown', ACTION='write',RECL=800) ! DO ip=1,npt write(*,*) ip,npt sumW = 0.d0 IF(BCid(ip).EQ.0)THEN !bc ipc = PART(ip)%mcii jpc = PART(ip)%mcjj DO ii = ipc-1,ipc+1 DO jj = jpc-1,jpc+1 IF(ASSOCIATED(LISTPART(ii,jj)%head)==.true.) THEN CALL dlist_begin(LISTPART(ii,jj)) DO !jp=1,npt jp = LISTPART(ii,jj)%current%value CALL KERNEL(rp(:,ip),rp(:,jp),lij(ip),lij(jp),W,dW,xij) r = SQRT( (rp(1,ip)-rp(1,jp))**2.d0 + (rp(2,ip)-rp(2,jp))**2.d0) VPL(1) = QQ(2,ip)/QQ(1,ip) VPL(2) = QQ(3,ip)/QQ(1,ip) VPR(1) = QQ(2,jp)/QQ(1,jp) VPR(2) = QQ(3,jp)/QQ(1,jp) ! !left state dxL = (rp(1,jp)-rp(1,ip))/2.d0 dyL = (rp(2,jp)-rp(2,ip))/2.d0 CALL RECONSTRUCTION DO iVar=1,nVar QL(iVar) = QQ(iVar,ip) + DOT_PRODUCT((MLScoeff(:)*QQmls(:,ip,iVar)),VECT_L(:)) ENDDO CALL PDEPositivity(status,QL) IF(status.LT.0) THEN QL = QQ(:,ip) ENDIF ! !right state dxR = -(rp(1,jp)-rp(1,ip))/2.d0 dyR = -(rp(2,jp)-rp(2,ip))/2.d0 CALL RECONSTRUCTION DO iVar=1,nVar QR(iVar) = QQ(iVar,jp) + DOT_PRODUCT((MLScoeff(:)*QQmls(:,jp,iVar)),VECT_R(:)) ENDDO CALL PDEPositivity(status,QR) IF(status.LT.0) THEN QR = QQ(:,jp) ENDIF VL(1) = QL(2)/QL(1) VL(2) = QL(3)/QL(1) VR(1) = QR(2)/QR(1) VR(2) = QR(3)/QR(1) UU = 0.5d0*(VL+VR) ! VL=0.d0 VR=0.d0 UU=0.d0 VPL=0.d0 VPR=0.d0 ! FL = PDEflux(nVar,UU,QL(:)) FR = PDEflux(nVar,UU,QR(:)) FI = PDEflux(nVar,VPL,QQ(:,ip)) Gij = 0.5d0*(FL+FR)-FI Vij = 0.5d0*(VL+VR)-VPL ! Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp) sumW=sumW+W*VOL(jp) ! ! VFn=0.d0 ! CALL RUSANOV(QL,QR,XIJ,UU,CIJ) LLn(1:nVar,ip) = LLn(1:nVar,ip) - VOL(ip)*VOL(jp)*( MATMUL((2.d0*GIJ),dW) - MATMUL(CIJ,(QR-QL))*DOT_PRODUCT(xij,dW) ) LLn(nVar+1,ip) = LLn(nVar+1,ip) + VOL(ip)*VOL(jp)*(2.d0*DOT_PRODUCT(Vij,dW)) ! ! ! ! if(ip==8098.and.istep==2.and.ii==ipc.and.jj==jpc+1.and.jp==7167)then ! ! write(1,*) ip, ',', RP(1,ip),',',RP(2,ip),',',jp, ',', RP(1,jp),',',RP(2,jp),',',STAMP(1) ! ! write(1,*) VOL(ip),VOL(jp) ! ! write(1,*) QL ! ! write(1,*) QR ! ! write(1,*) '----' ! ! write(1,*) QQmls(:,ip,1) ! ! stop ! ! endif ! ! ! ! IF(.NOT. dlist_move_next(LISTPART(ii,jj))) EXIT ENDDO ENDIF ENDDO ENDDO Vfn(:,ip) = Vfn(:,ip)/sumW ELSE Vfn(:,ip) = [0.d0,0.d0] ENDIF ENDDO ! CLOSE(1) ! ENDSUBROUTINE
Problably, I just not remember correctly the run time
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All right, with the newly provided subroutine code I can see that there are only two arguments with INTENT(OUT), namely, LLn and VFN. Of these LLn may contain some non-zero values after the subroutine is called and returns, but VFN will always contain zeros. VFN is not used in computing anything else in the subroutine.
Given these findings, why is VFN an argument of the subroutine? Why can you not simply remove VFN as an argument and remove the lines that contain VFN in the body of the subroutine?
Is there, possibly, something unusual in the modules that you USE?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all, Dear mecej4,
VFN is zero because I am debugging the code. After the first debugging VFN is not going to be zero.
What do you mean for "unusual in the modules that you USE"?
I really do not understand, in another subroutine I do a lot of more operations and it is much faster.
What do you think?
thanks a lot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you remove VFN as an argument and remove all lines of code which are unnecessary because VFN=0, does the slowdown still occur?
The reason that I asked about the modules is because your code USEs many, and there may be declarations of attributes in those modules of the variables in the subroutine. These attributes may play a role in the behavior of the code.
I am afraid that you will have to put together a complete reproducer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an interesting experiment, what happens if you replace:
Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp)
with:
Vfn(1,ip) = Vfn(1,ip)+VPR*W*VOL(jp)
Vfn(2,ip) = Vfn(2,ip)+VPR*W*VOL(jp)
Have you enabled optimization reports and if so, what do they say for this code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all, Dear Steve,
It does not change.
What do you mean with: "enabled optimization reports"?
What should can I do?
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-qopt-report=3 will give you lots of information about optimizations performed or not performed.
Can you upload a gzipped tar file with all of the sources needed to compile this subroutine? (Sources for all of the modules.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All, Dear Steve,
this is the message when I try to use '-qopt-report=3'.
I compile with:
ifort -r8 *.f90 -qopt-report=3 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential
and here the message:
ifort: command line warning #10006: ignoring unknown option '-qopt-report=3
Unfourtantly, I could not share my code, at least for the moment. I have to write scientific paper on it.
Do you have some suggestions?
Thanks a lot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In past versions of Ifort, opt-report was spelled without q or =.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I changed the 500KB of inline text to an attachment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Steve, Dear all,
Thanks a lot,
Are you able to see something from that file?
If yes, could you explain it to me?
I am not able to interpret it.
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is not really my area of expertise - others in this thread might understand it better. I didn't see anything obvious, though we don't know if the source you quoted is the exact file or only part of it, so the line numbers may be off. It looked to me as if the compiler optimized the routine well. It would be interesting to see the report from just mwsph.f90 when you don't add the statement to compare. It might be that the compiler decided that without the assignment it could throw away a lot of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have seen this thread wander without focus.
We do not have a single source code that can be actually compiled (because many of the modules are missing), and yet Diedro is asking us to interpret 500 kB vectorization reports. The code provided in #3 contains some usages that have dubious justification and that are likely to hinder optimization -- for example, comparing the value of a logical variable to .true., raising a real number to the power 2.0d0 when 2 would have sufficed and having variables set but never appearing to be used (r, for example). To top it off, there are subroutine calls with no information given on what happens in those subroutines.
To me the task looks fruitless. There are too many unknown factors and it is unreasonable to expect a solution to be given despite the presence of those factors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all, Dear mecej4 ,
thanks a lot for your comment. I have put 2 instead of 2.0. As I told you I could not post the entire code. I do not know much about optimizations.
Here the subroutine with the essential parts:
MODULE MOD_MWSPH CONTAINS !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ SUBROUTINE MWSPH(nVar,npt,TayOrder,Ncoeff,Rbc,x0,y0,BCid,RP,LIJ,VOL,QQ,QQmls,LLn,VFN,istep) USE MOD_COMMONVARS, ONLY: FluxType USE MOD_NUMERICALFLUXES_RUSANOV USE MOD_NUMERICALFLUXES_OSHER USE MOD_KERNEL USE MOD_LISTPART USE MOD_GASCARAT USE MOD_VARS_MLS USE MOD_RECONSTRUCTION USE MOD_PDE USE MOD_PDEflux IMPLICIT NONE INTEGER ,INTENT(IN) :: nVar INTEGER ,INTENT(IN) :: npt INTEGER ,INTENT(IN) :: TayOrder INTEGER ,INTENT(IN) :: Ncoeff REAL ,INTENT(IN) :: Rbc REAL ,INTENT(IN) :: x0 REAL ,INTENT(IN) :: y0 INTEGER ,DIMENSION(npt) ,INTENT(IN) :: BCid REAL ,DIMENSION(2,npt) ,INTENT(IN) :: RP REAL ,DIMENSION(npt) ,INTENT(IN) :: Lij REAL ,DIMENSION(npt) ,INTENT(IN) :: VOL REAL ,DIMENSION(nVar,npt) ,INTENT(IN) :: QQ REAL ,DIMENSION(Ncoeff,npt,nVar) ,INTENT(IN) :: QQmls REAL ,DIMENSION(nVar+1,npt) ,INTENT(OUT) :: LLn REAL ,DIMENSION(2,npt) ,INTENT(OUT) :: VFN INTEGER ,INTENT(IN) :: istep !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ INTEGER :: ip,jp,ipc,jpc,ii,jj,status,iVar REAL :: dxL,dxR,dyL,dyR,rr,sumW REAL ,DIMENSION(2) :: VPL,VPR REAL :: W !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ LLn=0.d0 VFN=0.d0 DO ip=1,npt write(*,*) ip,npt sumW = 0.d0 IF(BCid(ip).EQ.0)THEN !bc ipc = PART(ip)%mcii jpc = PART(ip)%mcjj DO ii = ipc-1,ipc+1 DO jj = jpc-1,jpc+1 IF(ASSOCIATED(LISTPART(ii,jj)%head)==.true.) THEN CALL dlist_begin(LISTPART(ii,jj)) DO !jp=1,npt jp = LISTPART(ii,jj)%current%value ! Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp) sumW=sumW+W*VOL(jp) VFn=0.d0 IF(.NOT. dlist_move_next(LISTPART(ii,jj))) EXIT ENDDO ENDIF ENDDO ENDDO Vfn(:,ip) = Vfn(:,ip)/sumW ELSE Vfn(:,ip) = [0.d0,0.d0] ENDIF ENDDO Vfn(:,:) = 0.d0 ENDSUBROUTINE !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ !++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ENDMODULE
I have remove all the subroutine inside mwsph.90 and still the subroutione mwsph is slow if i use:
Vfn(:,ip) = Vfn(:,ip)+VPR*W*VOL(jp) sumW=sumW+W*VOL(jp)
I am sorry to insist and bore you, but I am here to understand and study.
Any helps and suggestion will be very very appreciate.
Diego
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are running Debug build, the default for the Runtime Check of: Check array subscript for out of range, is enabled. This adds considerable overhead. Once you are assured the file containing the subroutine is NOT accessing arrays with subscripts out of bounds, you can disable the array subscript checks for that file (or all files in the project). Consider having multiple Debug builds:
Debug - single threaded, all runtime checks
DebugNoBounds - single threaded, all runtime checks except for array subscript checks
DebugOpenMP - OpenMP threaded, all runtime checks
DebugOpenMPNoBounds - OpenMPthreaded, all runtime checks except for array subscript checks
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The quoted report shows allocation and copying of a temporary array at the location where you report stalls. That's often a bad sign. It could easily take more than twice as long as equivalent code without the temporary array.
It looks like you have potential dead code as you seem to populate a section of the array and then zero out the entire array. The actions of the compiler in dealing with dead code aren't easily predictable, but may not be as thorough here as they would be in a potentially vectorizable outer loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all, Dear jimdempseyatthecove, Dear Tim,
now I have compiled just two times instead of one and it becames faster again. I have compiled it before with
ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential -CB ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential -CB
After with
ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential ifort -r8 *.f90 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential
Now I need just one.
I really do not understand why. It is because the subroutine is inside a module?
Do you want that I upload the report fine another time?
Thanks a lot, I am learning a lot

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page