Solved: Theoretically insignificant code ruining my performance

Daniel_Dopico · ‎05-06-2019

This small code is completely ruining my whole program performance. The code generated by the compiler has a huge overhead when calling to "SP_entry_add". The problem are not the instructions inside, but the function call itself.

Making some tests the problem seems to be associated to the call "CALL ALLOC_spMatrix(spA,nircn) ". If I simply replace this call for the two calls inside ALLOC_spMatrix ( that is "CALL ALLOC_ircn(spA%irn,spA%icn,nalloc)" and "CALL moveALLOCrel_realVector(spA%M_SP,nalloc)"), the code is much faster.

Therefore my guess is that the compiler is doing a very poor job when I try to pass the derived type spA and allocate memory in a different function. I checked with a different compiler and it doesn't suffer from this drawback.

MODULE TIPOS_DERIVADOS
    TYPE spMatrix
        INTEGER::nz=0,nz_tot=0,dimRow=0,dimCol=0
        INTEGER,ALLOCATABLE,DIMENSION(:)::irn,icn
        REAL(8),ALLOCATABLE,DIMENSION(:)::M_SP
        INTEGER,DIMENSION(:,:),ALLOCATABLE::pattern
        LOGICAL::preprocesada=.FALSE.,AvoidDoubles=.FALSE.
    END TYPE spMatrix
    
CONTAINS

    SUBROUTINE moveALLOCrel_intVector(irc,nalloc,ival)
        INTEGER nircn,nalloc,ival
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irc_aux,irc
	    INTENT(IN) nalloc
        INTENT(INOUT) irc
        OPTIONAL ival
       
        if(allocated(irc)) then
            nircn=size(irc)
            ALLOCATE(irc_aux(nircn+nalloc))
            irc_aux(1:nircn)=irc
            IF(PRESENT(ival)) irc_aux(nircn+1:nircn+nalloc)=ival
            CALL MOVE_ALLOC(irc_aux,irc)
        elseif(nalloc.gt.0) then
            ALLOCATE(irc(nalloc))
            IF(PRESENT(ival)) irc=ival
        else
            ALLOCATE(irc(25))  ! Tamaño mínimo de reserva de 25
            IF(PRESENT(ival)) irc=ival
        endif
    END SUBROUTINE moveALLOCrel_intVector

    SUBROUTINE moveALLOCrel_realVector(M_SP,nadd,val)
        INTEGER nM_SP,nadd
 	    REAL(8),ALLOCATABLE,DIMENSION(:)::M_SP_aux,M_SP
        REAL(8),INTENT(IN),OPTIONAL::val
	    INTENT(IN) nadd
        INTENT(INOUT) M_SP

        if(allocated(M_SP)) then
            nM_SP=size(M_SP)
            ALLOCATE(M_SP_aux(nM_SP+nadd))
            M_SP_aux(1:nM_SP)=M_SP
            IF(PRESENT(val)) M_SP_aux(nM_SP+1:nM_SP+nadd)=val
            CALL MOVE_ALLOC(M_SP_aux,M_SP)
        elseif(nadd.gt.0) then
            ALLOCATE(M_SP(nadd))
            IF(PRESENT(val)) M_SP=val
        else
            ALLOCATE(M_SP(25))  ! Tamaño mínimo de reserva de 25
            IF(PRESENT(val)) M_SP=val
        endif
    END SUBROUTINE moveALLOCrel_realVector
END MODULE TIPOS_DERIVADOS
    
MODULE sparse
USE TIPOS_DERIVADOS
CONTAINS
    SUBROUTINE SP_entry_add(spA,i,j,M_val,nz)
    TYPE(spMatrix)::spA
	REAL(8),INTENT(IN)::M_val
	INTEGER,INTENT(IN)::i,j
    INTEGER,INTENT(OUT),OPTIONAL::nz
	INTEGER nircn

	spA%nz=spA%nz+1
    if(PRESENT(nz)) nz=spA%nz
    IF(.NOT.spA%preprocesada) THEN
        spA%nz_tot=spA%nz
        if(ALLOCATED(spA%irn)) then
            nircn=size(spA%irn)
            IF (spA%nz.gt.nircn) CALL ALLOC_spMatrix(spA,nircn)
        else
            CALL ALLOC_spMatrix(spA,100)
        endif
	    spA%irn(spA%nz)=i
	    spA%icn(spA%nz)=j
        spA%dimRow = max(spA%dimRow, i)
        spA%dimCol = max(spA%dimCol, j)
    ELSEIF(spA%irn(spA%nz).ne.i.OR.spA%icn(spA%nz).ne.j) THEN
	    print *, 'sparse::SP_entry_add: ERROR DE ENSAMBLAJE, LOS INDICES DE ENSAMBLAJE NO COINCIDEN CON EL PREPROCESO. nz=', spA%nz
	    STOP -1
    ENDIF
	spA%M_SP(spA%nz)=M_val
    END SUBROUTINE SP_entry_add

    SUBROUTINE ALLOC_spMatrix(spA,nalloc)
        CLASS(spMatrix),INTENT(INOUT)::spA
        INTEGER nircn,nalloc
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irn_aux,icn_aux
	    INTENT(IN) nalloc

        CALL ALLOC_ircn(spA%irn,spA%icn,nalloc)
        CALL moveALLOCrel_realVector(spA%M_SP,nalloc)
    END SUBROUTINE ALLOC_spMatrix

    SUBROUTINE ALLOC_ircn(irow,icol,nalloc)
        INTEGER nircn,nalloc
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irn_aux,icn_aux,irow,icol
	    INTENT(IN) nalloc
        INTENT(INOUT) irow,icol

        CALL moveALLOCrel_intVector(irow,nalloc)
        CALL moveALLOCrel_intVector(icol,nalloc)
    END SUBROUTINE ALLOC_ircn

END MODULE sparse

PROGRAM main
USE sparse
IMPLICIT NONE
TYPE(spMatrix) spA

CALL SP_entry_ADD(spA,1,1,1.d0)

END PROGRAM main

Steve_Lionel · ‎05-06-2019

When you pass a TYPE(something) value to a dummy argument that is CLASS(something), the compiler has to build a large data structure pointing to the CLASS definition. This is a lot of code. There are probably other ways of handling this, but that's what the Intel compiler does.

I'm curious as to why you used CLASS here, since the type is not extended. If you change CLASS to TYPE it should go a lot faster. I'd guess that the other compiler does a better job of recognizing the case.

View solution in original post

Steve_Lionel · ‎05-06-2019

When you pass a TYPE(something) value to a dummy argument that is CLASS(something), the compiler has to build a large data structure pointing to the CLASS definition. This is a lot of code. There are probably other ways of handling this, but that's what the Intel compiler does.

I'm curious as to why you used CLASS here, since the type is not extended. If you change CLASS to TYPE it should go a lot faster. I'd guess that the other compiler does a better job of recognizing the case.

Daniel_Dopico · ‎05-07-2019

Thank you very much Steve for pointing this out. This CLASS dummy argument was completely unnoticed in my tests.

SPmatrix is a type extended by other types. I guess that's why I defined it as a CLASS dummy argument, nevertheless for the use given in this small code is not needed and I probably can change it easily also in my library.

I will give it a try and let you know about the outcome.

Daniel.

Daniel_Dopico · ‎05-08-2019

Looking at the dissasembly, that ugly code present with the CLASS dummy argument dissapeared and this is noticeable looking at the performance too. so thank you vey much, Steve.

My comment now is: I suppose that Intel is aware of this "problem" and it doesn't need for a better implementation.

Daniel.

Barbara_P_Intel · ‎05-08-2019

There is a compiler option that may help you find those instances where the temporary storage is created.

/check:arg_temp_created 
-check arg_temp_created
Enables run-time checking on whether actual arguments are copied into temporary storage before routine calls. If a copy is made at run-time, an informative message is displayed.

Steve_Lionel · ‎05-08-2019

This isn't an argument temp, Barbara. It's the data structure the compiler uses for CLASS objects. An AWFUL lot of code is generated for each one, I have to think that there is a better way of handling this, maybe with some compile-time template that gets pointed to so only the parts unknown until runtime need to be filled in. An optimization could also be made for calls where the dummy is not CLASS(*). I know the compiler team has a lot on its plate, but this is going to be a sore point as more and more users embrace polymorphism.

Daniel_Dopico · ‎05-09-2019

Thank you Steve and Barbara for your feedback.

Steve, I don't know what is the solution, but you are right that there is a better way of handling this. Other people in my team use gfortran, we compared the dissasembly and it seems that gfortran has some kind of optimization for this case, because the code is much smaller for the CLASS case.

Daniel.

Daniel_Dopico · ‎05-09-2019

Btw, if you think that this issue deserves a ticket, I can put it, for a better report of it.

Mark_Lewy · ‎05-10-2019

Hi Daniel,

I may have already reported something similar to Intel Support (#04115658). In my case, the code creating the class descriptor is resulting in a benign data race in an OpenMP loop, which is adversely affecting performance. I'd say the more people that report this, the more likely it will be addressed.

Best regards,

Mark

Daniel_Dopico · ‎05-10-2019

Thank you Mark.

I think you are right. One of the reasons why we program in Fortran is because we assume that our code is going to be always optimal from the efficiency point of view and we should be freed of worrying about these details as much as possible. I still have to report it.

Daniel.