- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This small code is completely ruining my whole program performance. The code generated by the compiler has a huge overhead when calling to "SP_entry_add". The problem are not the instructions inside, but the function call itself.
Making some tests the problem seems to be associated to the call "CALL ALLOC_spMatrix(spA,nircn) ". If I simply replace this call for the two calls inside ALLOC_spMatrix ( that is "CALL ALLOC_ircn(spA%irn,spA%icn,nalloc)" and "CALL moveALLOCrel_realVector(spA%M_SP,nalloc)"), the code is much faster.
Therefore my guess is that the compiler is doing a very poor job when I try to pass the derived type spA and allocate memory in a different function. I checked with a different compiler and it doesn't suffer from this drawback.
MODULE TIPOS_DERIVADOS
    TYPE spMatrix
        INTEGER::nz=0,nz_tot=0,dimRow=0,dimCol=0
        INTEGER,ALLOCATABLE,DIMENSION(:)::irn,icn
        REAL(8),ALLOCATABLE,DIMENSION(:)::M_SP
        INTEGER,DIMENSION(:,:),ALLOCATABLE::pattern
        LOGICAL::preprocesada=.FALSE.,AvoidDoubles=.FALSE.
    END TYPE spMatrix
    
CONTAINS
    SUBROUTINE moveALLOCrel_intVector(irc,nalloc,ival)
        INTEGER nircn,nalloc,ival
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irc_aux,irc
	    INTENT(IN) nalloc
        INTENT(INOUT) irc
        OPTIONAL ival
       
        if(allocated(irc)) then
            nircn=size(irc)
            ALLOCATE(irc_aux(nircn+nalloc))
            irc_aux(1:nircn)=irc
            IF(PRESENT(ival)) irc_aux(nircn+1:nircn+nalloc)=ival
            CALL MOVE_ALLOC(irc_aux,irc)
        elseif(nalloc.gt.0) then
            ALLOCATE(irc(nalloc))
            IF(PRESENT(ival)) irc=ival
        else
            ALLOCATE(irc(25))  ! Tamaño mínimo de reserva de 25
            IF(PRESENT(ival)) irc=ival
        endif
    END SUBROUTINE moveALLOCrel_intVector
    SUBROUTINE moveALLOCrel_realVector(M_SP,nadd,val)
        INTEGER nM_SP,nadd
 	    REAL(8),ALLOCATABLE,DIMENSION(:)::M_SP_aux,M_SP
        REAL(8),INTENT(IN),OPTIONAL::val
	    INTENT(IN) nadd
        INTENT(INOUT) M_SP
        if(allocated(M_SP)) then
            nM_SP=size(M_SP)
            ALLOCATE(M_SP_aux(nM_SP+nadd))
            M_SP_aux(1:nM_SP)=M_SP
            IF(PRESENT(val)) M_SP_aux(nM_SP+1:nM_SP+nadd)=val
            CALL MOVE_ALLOC(M_SP_aux,M_SP)
        elseif(nadd.gt.0) then
            ALLOCATE(M_SP(nadd))
            IF(PRESENT(val)) M_SP=val
        else
            ALLOCATE(M_SP(25))  ! Tamaño mínimo de reserva de 25
            IF(PRESENT(val)) M_SP=val
        endif
    END SUBROUTINE moveALLOCrel_realVector
END MODULE TIPOS_DERIVADOS
    
MODULE sparse
USE TIPOS_DERIVADOS
CONTAINS
    SUBROUTINE SP_entry_add(spA,i,j,M_val,nz)
    TYPE(spMatrix)::spA
	REAL(8),INTENT(IN)::M_val
	INTEGER,INTENT(IN)::i,j
    INTEGER,INTENT(OUT),OPTIONAL::nz
	INTEGER nircn
	spA%nz=spA%nz+1
    if(PRESENT(nz)) nz=spA%nz
    IF(.NOT.spA%preprocesada) THEN
        spA%nz_tot=spA%nz
        if(ALLOCATED(spA%irn)) then
            nircn=size(spA%irn)
            IF (spA%nz.gt.nircn) CALL ALLOC_spMatrix(spA,nircn)
        else
            CALL ALLOC_spMatrix(spA,100)
        endif
	    spA%irn(spA%nz)=i
	    spA%icn(spA%nz)=j
        spA%dimRow = max(spA%dimRow, i)
        spA%dimCol = max(spA%dimCol, j)
    ELSEIF(spA%irn(spA%nz).ne.i.OR.spA%icn(spA%nz).ne.j) THEN
	    print *, 'sparse::SP_entry_add: ERROR DE ENSAMBLAJE, LOS INDICES DE ENSAMBLAJE NO COINCIDEN CON EL PREPROCESO. nz=', spA%nz
	    STOP -1
    ENDIF
	spA%M_SP(spA%nz)=M_val
    END SUBROUTINE SP_entry_add
    SUBROUTINE ALLOC_spMatrix(spA,nalloc)
        CLASS(spMatrix),INTENT(INOUT)::spA
        INTEGER nircn,nalloc
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irn_aux,icn_aux
	    INTENT(IN) nalloc
        CALL ALLOC_ircn(spA%irn,spA%icn,nalloc)
        CALL moveALLOCrel_realVector(spA%M_SP,nalloc)
    END SUBROUTINE ALLOC_spMatrix
    SUBROUTINE ALLOC_ircn(irow,icol,nalloc)
        INTEGER nircn,nalloc
 	    INTEGER,ALLOCATABLE,DIMENSION(:)::irn_aux,icn_aux,irow,icol
	    INTENT(IN) nalloc
        INTENT(INOUT) irow,icol
        CALL moveALLOCrel_intVector(irow,nalloc)
        CALL moveALLOCrel_intVector(icol,nalloc)
    END SUBROUTINE ALLOC_ircn
END MODULE sparse
PROGRAM main
USE sparse
IMPLICIT NONE
TYPE(spMatrix) spA
CALL SP_entry_ADD(spA,1,1,1.d0)
END PROGRAM main
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you pass a TYPE(something) value to a dummy argument that is CLASS(something), the compiler has to build a large data structure pointing to the CLASS definition. This is a lot of code. There are probably other ways of handling this, but that's what the Intel compiler does.
I'm curious as to why you used CLASS here, since the type is not extended. If you change CLASS to TYPE it should go a lot faster. I'd guess that the other compiler does a better job of recognizing the case.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you pass a TYPE(something) value to a dummy argument that is CLASS(something), the compiler has to build a large data structure pointing to the CLASS definition. This is a lot of code. There are probably other ways of handling this, but that's what the Intel compiler does.
I'm curious as to why you used CLASS here, since the type is not extended. If you change CLASS to TYPE it should go a lot faster. I'd guess that the other compiler does a better job of recognizing the case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much Steve for pointing this out. This CLASS dummy argument was completely unnoticed in my tests.
SPmatrix is a type extended by other types. I guess that's why I defined it as a CLASS dummy argument, nevertheless for the use given in this small code is not needed and I probably can change it easily also in my library.
I will give it a try and let you know about the outcome.
Daniel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking at the dissasembly, that ugly code present with the CLASS dummy argument dissapeared and this is noticeable looking at the performance too. so thank you vey much, Steve.
My comment now is: I suppose that Intel is aware of this "problem" and it doesn't need for a better implementation.
Daniel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is a compiler option that may help you find those instances where the temporary storage is created.
/check:arg_temp_created -check arg_temp_created Enables run-time checking on whether actual arguments are copied into temporary storage before routine calls. If a copy is made at run-time, an informative message is displayed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This isn't an argument temp, Barbara. It's the data structure the compiler uses for CLASS objects. An AWFUL lot of code is generated for each one, I have to think that there is a better way of handling this, maybe with some compile-time template that gets pointed to so only the parts unknown until runtime need to be filled in. An optimization could also be made for calls where the dummy is not CLASS(*). I know the compiler team has a lot on its plate, but this is going to be a sore point as more and more users embrace polymorphism.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Steve and Barbara for your feedback.
Steve, I don't know what is the solution, but you are right that there is a better way of handling this. Other people in my team use gfortran, we compared the dissasembly and it seems that gfortran has some kind of optimization for this case, because the code is much smaller for the CLASS case.
Daniel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Btw, if you think that this issue deserves a ticket, I can put it, for a better report of it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Daniel,
I may have already reported something similar to Intel Support (#04115658). In my case, the code creating the class descriptor is resulting in a benign data race in an OpenMP loop, which is adversely affecting performance. I'd say the more people that report this, the more likely it will be addressed.
Best regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Mark.
I think you are right. One of the reasons why we program in Fortran is because we assume that our code is going to be always optimal from the efficiency point of view and we should be freed of worrying about these details as much as possible. I still have to report it.
Daniel.
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page