- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
in the course of modernizing some of my code I have moved to putting many "call by refernce" routines which were previously "stand-alone" behind types. In addition arrays are now passed using type bound pointers, usually with the "contiguous" attribute. This has the big advantage of avoiding to pass array boundaries explicitly (because many of my arrays start at index value zero). However, I have noticed a speed difference to an extent that the type bound routines need up to twice as much time for processing large arrays than the direct routines.
The below program mimics the above structure. It implements the frist part of an implicit multiplication of a 4.5 Mio x 100 matrix with an 4.8 Mio x 4.8 Mio structured sparse co-variance matrix (the latter matrix can be stored in form of four vectors and is held by the type). This routine needs about 4.8 seconds when called directly, and about 6.6 seconds when called through the type. This is n ot a big difference in absolute numbers, but sums up when performed several thousand times. Given that the type transports the array into the routine via a pointer with attribute "contiguous" this difference in speed should not appear. however, may be I haven't understood the standard incorrectly. The speed was measured on a Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 processors. The compiler was ifort 17.4. The data set can be supplied on request.
Any Ideas?
Thanks a lot
Module Mod_Direct
contains
Subroutine SubDrop(ISUBound1,ISUBound2,RMInOut,ivi,ivs&
&,ivd,rvp,ISNThreads)
!$ use omp_lib
Implicit None
Integer*8, Intent(In) :: ISUbound1, ISUBound2
Real*8, Intent(InOut) :: RMInOut(0:ISUbound1,1:ISUBound2)
Integer*8, Intent(In) :: ivs(:), ivd(:), ivi(:)
Real*8, Intent(In) :: RVp(:)
Integer*4, intent(in), optional :: ISNThreads
Integer*8 :: c1, c2, ss, dd, ii
outer:block
RMInOut(0,:)=0.0D00
!$ if(present(ISNThreads)) Then
!$ if(ISNThreads>size(RMInOUt,2)) Then
!$ call omp_set_num_threads(size(RMInOut,2))
!$ else
!$ call omp_set_num_threads(ISNThreads)
!$ End if
!$ else
!$ c1=omp_get_max_threads()
!$ if(c1>size(RMInout,2)) Then
!$ call omp_set_num_threads(size(RMInout,2))
!$ else
!$ call omp_set_num_threads(int(c1,4))
!$ End if
!$ end if
!$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
Do c1=1,size(RMInOut,2)
Do c2=1,size(IVI,1)
ss=ivs(c2)
dd=ivd(c2)
ii=ivi(c2)
RMInOut(ii,c1)=RMInOut(ii,c1)+rvp(c2)*(RMInOut(ss,c1)&
&+RMInOut(dd,c1))
End Do
End Do
!$OMP END PARALLEL DO
End block outer
End Subroutine SubDrop
end Module Mod_Direct
Module Mod_Type
Type :: Testtype
Integer*8, Allocatable :: ivi(:), ivs(:), ivd(:)
Integer*8 :: isn
Integer*4 :: ISSubStat
Real*8, Allocatable :: rvp(:)
Real*8, Pointer, contiguous :: RMInout(:,:)
Character(:), allocatable :: csmsg
contains
procedure, pass, public :: drop=>subdrop
End type Testtype
Interface
Module Subroutine SubDrop(this,ISNThreads)
Class(TestType) :: this
Integer*4, optional :: ISNThreads
end Subroutine
End Interface
Private :: SubDrop
end Module Mod_Type
SubModule(Mod_Type) Drop
contains
Module Procedure SubDrop
!$ use omp_lib
Implicit None
Integer*8 :: c1, c2, ss, dd, ii
outer:block
if(.not.associated(this%RMInOut)) Then
this%CSMSG="ERROR"
this%ISSubStat=1;exit outer
end if
if(lbound(this%RMInOut,1)/=0) Then
this%CSMSG="ERROR"
this%ISSubStat=1;exit outer
End if
if(ubound(this%RMInOut,1)/=this%isn) Then
this%CSMSG="ERROR"
this%ISSubStat=1;exit outer
End if
this%RMInOut(0,:)=0.0D0
!$ if(present(ISNThreads)) Then
!$ if(ISNThreads>size(this%RMInOUt,2)) Then
!$ call omp_set_num_threads(size(this%RMInOut,2))
!$ else
!$ call omp_set_num_threads(ISNThreads)
!$ End if
!$ else
!$ c1=omp_get_max_threads()
!$ if(c1>size(this%RMInout,2)) Then
!$ call omp_set_num_threads(size(this%RMInout,2))
!$ else
!$ call omp_set_num_threads(int(c1,4))
!$ End if
!$ end if
!$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
Do c1=1,size(this%RMInOut,2)
Do c2=1,size(this%ivi,1)
ss=this%ivs(c2)
dd=this%ivd(c2)
ii=this%Ivi(c2)
this%RMInOut(ii,c1)=this%RMInOut(ii,c1)+this%RVP(c2)&
&*(this%RMInOut(ss,c1)+this%RMInOut(dd,c1))
End Do
End Do
!$OMP END PARALLEL DO
End block outer
end Procedure
End SubModule Drop
Program Test
use Mod_Type
use Mod_Direct
Implicit none
Type(TestType) :: TST
integer :: dim=4876565, dim3=100, c1
real*8, target, allocatable :: rmtmp(:,:)
real*8 :: t0, t1
!$ call omp_set_nested(.TRUE.)
Allocate(TST%ivi(dim),TST%ivs(dim),TST%ivd(dim),TST&
&%rvp(dim))
open(55,file="input.txt",action="read")
Do c1=1,dim
read(55,*) TST%ivi(c1),tst%ivs(c1),tst%ivd(c1),tst%rvp(c1)
end Do
tst%isn=maxval(tst%ivi)
Allocate(rmtmp(0:tst%isn,dim3),source=0.0D0)
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
TST%RMInOut=>rmtmp
call TST%drop()
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
!!call SubDrop(ISUBound1=Int(tst%isn,8),ISUBound2=Int(dim3,8),RMInout&
!! &=rmtmp,ivi=tst%ivi,ivs=tst%ivs,ivd=tst%ivd,rvp=tst%rvp)
End Program Test
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try using the parameter, which is visible to the compiler, as opposed to size(...) which requires a peek into an array descriptor.
Now this said, you eventual code may not necessarily use parameter specified dimensions. The purpose of using the parameter in your test program is to aid in diagnosing a potential compiler optimization deficiency, which will aid the compiler developers in addressing a solution.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you both looking at the optimization reports?
I don't see any problem in vectorizing the addition in latest test code. No need to change the simple array notation. The difference is that there is an addition layer, an indirect (virtual) function call, which prevents inlining. I doubt that the inlining overhead is significant for such a large array, but the lack of inlining means that less information (alignment, for example) is available to the vectorizer. Incidentally, I doubt whether -O3 brings any benefits in this simple test case.
Yes, it is possible that -ipo might help. Not because of inlining, which is already enabled within the source file, but because of "whole program optimization". If the compiler detects that it is seeing the entire program, it knows that it is seeing all possible targets of the indirect function call, which allows it to do more optimizations including inlining of particular targets. When I tried it, however, I saw an internal compiler error. When I added -xavx to the command line, the build worked, the whole program was detected and Subad was inlined, but the program did not run any faster than without -ipo and the vectorization was not as clean as when the addition was coded directly in subadab and subadcd.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all.
@ Jim: Your suggesitons, if it would make a difference, is only applicable for small programs which are re-compiled by the "end-user". This is not the case for me.
@ Martyn: Yes I looked at the optimization report. Thats were I picked up that there is hardly any difference execpt the non-inlining. However, I did not look when I tried Jim's proposal. The -ip option (a forum entry which I deleted later) did not give me any report any longer and just the advice that i should use the "advisor". Unfortunately I did not get that to work. It crashed with " [Instrumentation Engine]: Source/pin/injector_nonmac/auxvector.cpp: CopyAux: 291: unexpected AUX VEC type 26".
However, from what I have understood the problem from where the processing time difference, especially with regard to ifort vs. gfortran comparison, are coming from is not yet solved.
cheers
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »