- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
in the course of modernizing some of my code I have moved to putting many "call by refernce" routines which were previously "stand-alone" behind types. In addition arrays are now passed using type bound pointers, usually with the "contiguous" attribute. This has the big advantage of avoiding to pass array boundaries explicitly (because many of my arrays start at index value zero). However, I have noticed a speed difference to an extent that the type bound routines need up to twice as much time for processing large arrays than the direct routines.
The below program mimics the above structure. It implements the frist part of an implicit multiplication of a 4.5 Mio x 100 matrix with an 4.8 Mio x 4.8 Mio structured sparse co-variance matrix (the latter matrix can be stored in form of four vectors and is held by the type). This routine needs about 4.8 seconds when called directly, and about 6.6 seconds when called through the type. This is n ot a big difference in absolute numbers, but sums up when performed several thousand times. Given that the type transports the array into the routine via a pointer with attribute "contiguous" this difference in speed should not appear. however, may be I haven't understood the standard incorrectly. The speed was measured on a Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 processors. The compiler was ifort 17.4. The data set can be supplied on request.
Any Ideas?
Thanks a lot
Module Mod_Direct contains Subroutine SubDrop(ISUBound1,ISUBound2,RMInOut,ivi,ivs& &,ivd,rvp,ISNThreads) !$ use omp_lib Implicit None Integer*8, Intent(In) :: ISUbound1, ISUBound2 Real*8, Intent(InOut) :: RMInOut(0:ISUbound1,1:ISUBound2) Integer*8, Intent(In) :: ivs(:), ivd(:), ivi(:) Real*8, Intent(In) :: RVp(:) Integer*4, intent(in), optional :: ISNThreads Integer*8 :: c1, c2, ss, dd, ii outer:block RMInOut(0,:)=0.0D00 !$ if(present(ISNThreads)) Then !$ if(ISNThreads>size(RMInOUt,2)) Then !$ call omp_set_num_threads(size(RMInOut,2)) !$ else !$ call omp_set_num_threads(ISNThreads) !$ End if !$ else !$ c1=omp_get_max_threads() !$ if(c1>size(RMInout,2)) Then !$ call omp_set_num_threads(size(RMInout,2)) !$ else !$ call omp_set_num_threads(int(c1,4)) !$ End if !$ end if !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1) Do c1=1,size(RMInOut,2) Do c2=1,size(IVI,1) ss=ivs(c2) dd=ivd(c2) ii=ivi(c2) RMInOut(ii,c1)=RMInOut(ii,c1)+rvp(c2)*(RMInOut(ss,c1)& &+RMInOut(dd,c1)) End Do End Do !$OMP END PARALLEL DO End block outer End Subroutine SubDrop end Module Mod_Direct Module Mod_Type Type :: Testtype Integer*8, Allocatable :: ivi(:), ivs(:), ivd(:) Integer*8 :: isn Integer*4 :: ISSubStat Real*8, Allocatable :: rvp(:) Real*8, Pointer, contiguous :: RMInout(:,:) Character(:), allocatable :: csmsg contains procedure, pass, public :: drop=>subdrop End type Testtype Interface Module Subroutine SubDrop(this,ISNThreads) Class(TestType) :: this Integer*4, optional :: ISNThreads end Subroutine End Interface Private :: SubDrop end Module Mod_Type SubModule(Mod_Type) Drop contains Module Procedure SubDrop !$ use omp_lib Implicit None Integer*8 :: c1, c2, ss, dd, ii outer:block if(.not.associated(this%RMInOut)) Then this%CSMSG="ERROR" this%ISSubStat=1;exit outer end if if(lbound(this%RMInOut,1)/=0) Then this%CSMSG="ERROR" this%ISSubStat=1;exit outer End if if(ubound(this%RMInOut,1)/=this%isn) Then this%CSMSG="ERROR" this%ISSubStat=1;exit outer End if this%RMInOut(0,:)=0.0D0 !$ if(present(ISNThreads)) Then !$ if(ISNThreads>size(this%RMInOUt,2)) Then !$ call omp_set_num_threads(size(this%RMInOut,2)) !$ else !$ call omp_set_num_threads(ISNThreads) !$ End if !$ else !$ c1=omp_get_max_threads() !$ if(c1>size(this%RMInout,2)) Then !$ call omp_set_num_threads(size(this%RMInout,2)) !$ else !$ call omp_set_num_threads(int(c1,4)) !$ End if !$ end if !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1) Do c1=1,size(this%RMInOut,2) Do c2=1,size(this%ivi,1) ss=this%ivs(c2) dd=this%ivd(c2) ii=this%Ivi(c2) this%RMInOut(ii,c1)=this%RMInOut(ii,c1)+this%RVP(c2)& &*(this%RMInOut(ss,c1)+this%RMInOut(dd,c1)) End Do End Do !$OMP END PARALLEL DO End block outer end Procedure End SubModule Drop Program Test use Mod_Type use Mod_Direct Implicit none Type(TestType) :: TST integer :: dim=4876565, dim3=100, c1 real*8, target, allocatable :: rmtmp(:,:) real*8 :: t0, t1 !$ call omp_set_nested(.TRUE.) Allocate(TST%ivi(dim),TST%ivs(dim),TST%ivd(dim),TST& &%rvp(dim)) open(55,file="input.txt",action="read") Do c1=1,dim read(55,*) TST%ivi(c1),tst%ivs(c1),tst%ivd(c1),tst%rvp(c1) end Do tst%isn=maxval(tst%ivi) Allocate(rmtmp(0:tst%isn,dim3),source=0.0D0) !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ TST%RMInOut=>rmtmp call TST%drop() !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ !!call SubDrop(ISUBound1=Int(tst%isn,8),ISUBound2=Int(dim3,8),RMInout& !! &=rmtmp,ivi=tst%ivi,ivs=tst%ivs,ivd=tst%ivd,rvp=tst%rvp) End Program Test
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Were you using default integer indexing previously? Could you show the comparison reports with -qopt-report=4 ?
The IBM360 integer*8 idiom hardly counts as modernization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for the response. I only used the "*8" syntax for this example. In "real" world I have a "select real kind" function etc.
What does you first question mean??
I have the following additional facts: no speed difference when running on a single core. Speed difference almost vanished when avoiding "this%" inside the omp loops by putting an "associate" statement around the looks.
I changed the code slightly before getting the optimization report, so the code behind the report and the report are below. I did the compilation on a Intel(R) Core(TM) i7-6820HK CPU @ 2.70GHz with 8 processors.
thanks
Module Mod_Direct contains Subroutine SubDrop(ISUBound1,ISUBound2,RMInOut,ivi,ivs& &,ivd,rvp,ISNThreads) !$ use omp_lib Implicit None Integer*8, Intent(In) :: ISUbound1, ISUBound2 Real*8, Intent(InOut) :: RMInOut(0:ISUbound1,1:ISUBound2) Integer*8, Intent(In) :: ivs(:), ivd(:), ivi(:) Real*8, Intent(In) :: RVp(:) Integer*4, intent(in), optional :: ISNThreads Integer*8 :: c1, c2, ss, dd, ii outer:block RMInOut(0,:)=0.0D00 !$ if(present(ISNThreads)) Then !$ if(ISNThreads>size(RMInOUt,2)) Then !$ call omp_set_num_threads(size(RMInOut,2)) !$ else !$ call omp_set_num_threads(ISNThreads) !$ End if !$ else !$ c1=omp_get_max_threads() !$ if(c1>size(RMInout,2)) Then !$ call omp_set_num_threads(size(RMInout,2)) !$ else !$ call omp_set_num_threads(int(c1,4)) !$ End if !$ end if !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1) Do c1=1,size(RMInOut,2) Do c2=1,size(IVI,1) ss=ivs(c2) dd=ivd(c2) ii=ivi(c2) RMInOut(ii,c1)=RMInOut(ii,c1)+rvp(c2)*(RMInOut(ss,c1)& &+RMInOut(dd,c1)) End Do End Do !$OMP END PARALLEL DO End block outer End Subroutine SubDrop end Module Mod_Direct Module Mod_Type Type :: Testtype Integer*8, Allocatable :: ivi(:), ivs(:), ivd(:) Integer*8 :: isn Integer*4 :: ISSubStat Real*8, Allocatable :: rvp(:) Real*8, Pointer, contiguous:: RMInout(:,:) Character(:), allocatable :: csmsg contains procedure, pass, public :: drop=>subdrop End type Testtype Interface Module Subroutine SubDrop(this,ISNThreads) Class(TestType) :: this Integer*4, optional :: ISNThreads end Subroutine End Interface Private :: SubDrop end Module Mod_Type SubModule(Mod_Type) Drop contains Module Procedure SubDrop !$ use omp_lib Implicit None Integer*8 :: c1, c2, ss, dd, ii outer:block this%RMInOut(0,:)=0.0D0 !$ if(present(ISNThreads)) Then !$ if(ISNThreads>size(this%RMInOUt,2)) Then !$ call omp_set_num_threads(size(this%RMInOut,2)) !$ else !$ call omp_set_num_threads(ISNThreads) !$ End if !$ else !$ c1=omp_get_max_threads() !$ if(c1>size(this%RMInout,2)) Then !$ call omp_set_num_threads(size(this%RMInout,2)) !$ else !$ call omp_set_num_threads(int(c1,4)) !$ End if !$ end if !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1) Do c1=1,size(this%RMInOut,2) Do c2=1,size(this%ivi,1) ss=this%ivs(c2) dd=this%ivd(c2) ii=this%Ivi(c2) this%RMInOut(ii,c1)=this%RMInOut(ii,c1)+this%RVP(c2)& &*(this%RMInOut(ss,c1)+this%RMInOut(dd,c1)) End Do End Do !$OMP END PARALLEL DO End block outer end Procedure End SubModule Drop Program Test use Mod_Type use Mod_Direct Implicit none Type(TestType) :: TST integer :: dim=4876565, dim3=500, c1 real*8, target, allocatable :: rmtmp(:,:) real*8 :: t0, t1 Character(len=10) :: time !$ call omp_set_nested(.TRUE.) Allocate(TST%ivi(dim),TST%ivs(dim),TST%ivd(dim),TST& &%rvp(dim)) open(55,file="input.txt",action="read") Do c1=1,dim read(55,*) TST%ivi(c1),tst%ivs(c1),tst%ivd(c1),tst%rvp(c1) end Do tst%isn=maxval(tst%ivi) Allocate(rmtmp(0:tst%isn,dim3),source=0.0D0) !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ write(*,*) is_contiguous(rmtmp) TST%RMInOut=>rmtmp write(*,*) is_contiguous(TST%RMInOut) call date_and_time(time=time) write(*,*) time call TST%drop() call date_and_time(time=time) write(*,*) time !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ call date_and_time(time=time) write(*,*) time call SubDrop(ISUBound1=Int(tst%isn,8),ISUBound2=Int(dim3,8),rminout& &=rmtmp,ivi=tst%ivi,ivs=tst%ivs,ivd=tst%ivd,rvp=tst%rvp) call date_and_time(time=time) write(*,*) time End Program Test
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411 Compiler options: -O3 -qopenmp -static -qopt-report=4 Report from: Interprocedural optimizations [ipo] WHOLE PROGRAM (SAFE) [EITHER METHOD]: false WHOLE PROGRAM (SEEN) [TABLE METHOD]: false WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 In the inlining report below: "sz" refers to the "size" of the routine. The smaller a routine's size, the more likely it is to be inlined. "isz" refers to the "inlined size" of the routine. This is the amount the calling routine will grow if the called routine is inlined into it. The compiler generally limits the amount a routine can grow by having routines inlined into it. Begin optimization report for: TEST Report from: Interprocedural optimizations [ipo] INLINE REPORT: (TEST) [1/6=16.7%] Test.f90(98,9) -> EXTERN: (98,9) for_set_reentrancy -> EXTERN: (107,11) omp_set_nested_ -> EXTERN: (108,3) for_alloc_allocatable -> EXTERN: (108,3) for_check_mult_overflow64 -> EXTERN: (108,3) for_alloc_allocatable -> EXTERN: (108,3) for_check_mult_overflow64 -> EXTERN: (108,3) for_alloc_allocatable -> EXTERN: (108,3) for_check_mult_overflow64 -> EXTERN: (108,3) for_alloc_allocatable -> EXTERN: (108,3) for_check_mult_overflow64 -> EXTERN: (110,3) for_open -> EXTERN: (112,5) for_read_seq_lis_xmit -> EXTERN: (112,5) for_read_seq_lis_xmit -> EXTERN: (112,5) for_read_seq_lis_xmit -> EXTERN: (112,5) for_read_seq_lis -> EXTERN: (115,3) for_alloc_allocatable -> EXTERN: (115,3) for_check_mult_overflow64 -> EXTERN: (117,3) for_write_seq_lis -> EXTERN: (119,3) for_write_seq_lis -> EXTERN: (120,8) for_date_and_time -> EXTERN: (121,3) for_write_seq_lis -> (122,8) MOD_TYPE^SUBDROP (isz = 261) (sz = 268) [[ Unable to inline callsite ]] -> EXTERN: (123,8) for_date_and_time -> EXTERN: (124,3) for_write_seq_lis -> EXTERN: (126,8) for_date_and_time -> EXTERN: (127,3) for_write_seq_lis -> (128,8) SUBDROP (isz = 199) (sz = 218) [[ Unable to inline callsite ]] -> EXTERN: (130,8) for_date_and_time -> EXTERN: (131,3) for_write_seq_lis Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at Test.f90(111,3) remark #15382: vectorization support: call to function for_read_seq_lis cannot be vectorized [ Test.f90(112,5) ] remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized [ Test.f90(112,5) ] remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized [ Test.f90(112,5) ] remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized [ Test.f90(112,5) ] remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed OUTPUT dependence between at (112:5) and at (112:5) LOOP END LOOP BEGIN at Test.f90(114,11) remark #15388: vectorization support: reference TST(:) has aligned access remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.528 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 6 remark #15477: vector cost: 4.500 remark #15478: estimated potential speedup: 1.300 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Test.f90(114,11) <Remainder loop for vectorization> LOOP END LOOP BEGIN at Test.f90(115,12) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at Test.f90(115,12) remark #25408: memset generated remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at Test.f90(115,12) remark #15389: vectorization support: reference RMTMP(:,:) has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15309: vectorization support: normalized vectorization overhead 0.300 remark #15300: LOOP WAS VECTORIZED remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 4 remark #15477: vector cost: 2.500 remark #15478: estimated potential speedup: 1.450 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=3 LOOP END LOOP BEGIN at Test.f90(115,12) <Remainder loop for vectorization> remark #25015: Estimate of max trip count of loop=12 LOOP END LOOP END LOOP END Report from: Code generation optimizations [cg] Test.f90(98,9):remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (16, 0) Test.f90(115,12):remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation Test.f90(115,12):remark #34026: call to memset implemented as a call to optimized library version Test.f90(118,3):remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (16, 0) Test.f90(98,9):remark #34051: REGISTER ALLOCATION : [MAIN__] Test.f90:98 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 30[ rax rdx rcx rbx rsi rdi r8-r15 zmm0-zmm15] Routine temporaries Total : 430 Global : 58 Local : 372 Regenerable : 179 Spilled : 7 Routine stack Variables : 894 bytes* Reads : 31 [6.34e+01 ~ 4.3%] Writes : 91 [1.51e+02 ~ 10.2%] Spills : 16 bytes* Reads : 2 [5.00e+00 ~ 0.3%] Writes : 2 [3.85e+00 ~ 0.3%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: MOD_TYPE^SUBDROP Report from: Interprocedural optimizations [ipo] INLINE REPORT: (MOD_TYPE^SUBDROP) [2/6=33.3%] Test.f90(63,1) -> EXTERN: (72,19) omp_set_num_threads -> EXTERN: (74,19) omp_set_num_threads -> EXTERN: (77,15) omp_get_max_threads -> EXTERN: (79,19) omp_set_num_threads -> EXTERN: (81,19) omp_set_num_threads Report from: OpenMP optimizations [openmp] Test.f90(84:13-84:13):OMP:mod_type_mp_subdrop_: OpenMP DEFINED LOOP WAS PARALLELIZED Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at Test.f90(69,7) remark #15329: vectorization support: non-unit strided store was emulated for the variable <at (69:7)>, stride is unknown to compiler remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15300: LOOP WAS VECTORIZED remark #15453: unmasked strided stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 4 remark #15477: vector cost: 3.000 remark #15478: estimated potential speedup: 1.320 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Test.f90(69,7) <Remainder loop for vectorization> LOOP END LOOP BEGIN at Test.f90(85,7) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between this(c2) (87:11) and this(this(c2),c1) (90:11) LOOP BEGIN at Test.f90(86,9) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between this(c2) (87:11) and this(this(c2),c1) (90:11) LOOP END LOOP END Report from: Code generation optimizations [cg] Test.f90(63,1):remark #34051: REGISTER ALLOCATION : [mod_type_mp_subdrop_] Test.f90:63 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 16[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0] Routine temporaries Total : 177 Global : 49 Local : 128 Regenerable : 41 Spilled : 3 Routine stack Variables : 88 bytes* Reads : 5 [2.97e-01 ~ 0.0%] Writes : 8 [2.51e+01 ~ 2.2%] Spills : 72 bytes* Reads : 15 [1.76e+01 ~ 1.5%] Writes : 15 [1.93e+01 ~ 1.7%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: SUBDROP Report from: Interprocedural optimizations [ipo] INLINE REPORT: (SUBDROP) [3/6=50.0%] Test.f90(3,14) -> EXTERN: (17,19) omp_set_num_threads -> EXTERN: (19,19) omp_set_num_threads -> EXTERN: (22,15) omp_get_max_threads -> EXTERN: (24,19) omp_set_num_threads -> EXTERN: (26,19) omp_set_num_threads Report from: OpenMP optimizations [openmp] Test.f90(29:13-29:13):OMP:mod_direct_mp_subdrop_: OpenMP DEFINED LOOP WAS PARALLELIZED Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at Test.f90(14,7) remark #15329: vectorization support: non-unit strided store was emulated for the variable <RMINOUT(0,:)>, stride is unknown to compiler remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15300: LOOP WAS VECTORIZED remark #15453: unmasked strided stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 4 remark #15477: vector cost: 3.000 remark #15478: estimated potential speedup: 1.320 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Test.f90(14,7) <Remainder loop for vectorization> LOOP END LOOP BEGIN at Test.f90(30,7) <Multiversioned v1> remark #25233: Loop multiversioned for stride tests on Assumed shape arrays remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between rvp(c2) (35:11) and rminout(ivi(c2),c1) (35:11) LOOP BEGIN at Test.f90(31,9) remark #25084: Preprocess Loopnests: Moving Out Store [ Test.f90(34,11) ] remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between rvp(c2) (35:11) and rminout(ivi(c2),c1) (35:11) LOOP END LOOP END LOOP BEGIN at Test.f90(30,7) <Multiversioned v2> remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning LOOP BEGIN at Test.f90(31,9) remark #25084: Preprocess Loopnests: Moving Out Store [ Test.f90(34,11) ] remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between rminout(dd,c1) (35:11) and rminout(ivi(c2),c1) (35:11) LOOP END LOOP END Report from: Code generation optimizations [cg] Test.f90(3,14):remark #34051: REGISTER ALLOCATION : [mod_direct_mp_subdrop_] Test.f90:3 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 16[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0] Routine temporaries Total : 210 Global : 61 Local : 149 Regenerable : 60 Spilled : 11 Routine stack Variables : 296 bytes* Reads : 7 [2.97e-01 ~ 0.0%] Writes : 27 [4.41e+01 ~ 5.3%] Spills : 136 bytes* Reads : 26 [4.19e+01 ~ 5.0%] Writes : 23 [2.24e+01 ~ 2.7%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: mod_direct._ Report from: Interprocedural optimizations [ipo] INLINE REPORT: (mod_direct._) [4/6=66.7%] Test.f90(1,8) Report from: Code generation optimizations [cg] Test.f90(1,8):remark #34051: REGISTER ALLOCATION : [mod_direct._] Test.f90:1 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 0[ reg_null] Routine temporaries Total : 6 Global : 0 Local : 6 Regenerable : 0 Spilled : 0 Routine stack Variables : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Spills : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: mod_type._ Report from: Interprocedural optimizations [ipo] INLINE REPORT: (mod_type._) [5/6=83.3%] Test.f90(43,8) Report from: Code generation optimizations [cg] Test.f90(43,8):remark #34051: REGISTER ALLOCATION : [mod_type._] Test.f90:43 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 0[ reg_null] Routine temporaries Total : 6 Global : 0 Local : 6 Regenerable : 0 Spilled : 0 Routine stack Variables : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Spills : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: mod_type.drop._ Report from: Interprocedural optimizations [ipo] INLINE REPORT: (mod_type.drop._) [6/6=100.0%] Test.f90(62,21) Report from: Code generation optimizations [cg] Test.f90(62,21):remark #34051: REGISTER ALLOCATION : [mod_type.drop._] Test.f90:62 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 0[ reg_null] Routine temporaries Total : 6 Global : 0 Local : 6 Regenerable : 0 Spilled : 0 Routine stack Variables : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Spills : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. ===========================================================================
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apparently (not knowing your original version), you have inadvertently prevented the compiler from assuming that rvp and rminout don't overlap. If those were originally separate dummy arguments, the rules of Fortran would require them not to overlap, and the compiler could take advantage of that. You could try throwing !$omp simd on the inner loop to persuade to ignore possible overlap.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. Thanks for the reply.
I tried to figure out about simd but from what I understood it is about loop vectorization. So could you specifiy how to use it for getting the compiler to assume independency between the two arrays.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, another question is what part of the compilation output reveals the problem you mentioned.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
For the OpenMP version, you are missing the variable ii from the private clauses on lines 29 and 84, which for some reason impacts the performance of the type bound version more than the direct one. But since the results are presumably incorrect unless ii is declared private, this is unimportant.
With ii in the private clause, speeds are comparable (and much faster).
Another difference is that the dummy argument of SubDrop in the direct version is an adjustable array, whereas the derived type component is a pointer. In some circumstances, pointers may be harder to optimize because of the possibilities of aliasing or non-unit strides. That doesn't look to be the case here, though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Incidentally, Intel Inspector is an excellent tool for finding errors like this in threaded programs. Spotting them "by hand" (or by eye?) gets harder as the programs get bigger.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Martyn. The bug in the code was luckily introduced when putting the example together.
However, my problem is not really solved. The two blocks of code below do essentiall the same. But the second one spares me from writing routines. In the first block, the compiler knows that all involved arrays have a unique memory address. In the second block the compiler does not know where the pointers go which may affect optimization. Is there any way to tell the compiler that all pointer involved in the addition point to different memory locations??
Module TestModule Type :: TestType Real, Allocatable, dimension(:,:) :: a,b,c,d,e contains Procedure, Pass :: adab => subadab Procedure, Pass :: adcd => subadcd End type TestType contains Subroutine Subadab(this) Class(testtype), intent(inout) :: this this%e=this%a+this%b End Subroutine Subadab Subroutine Subadcd(this) Class(testtype), intent(inout) :: this this%e=this%c+this%d End Subroutine Subadcd End Module TestModule Program Test Use Testmodule Type(testtype) :: tt call tt%adab() call tt%adcd() End Program Test
Module TestModule Type :: TestType Real, Allocatable :: a(:,:),b(:,:),c(:,:), d(:,:), e(:,:) Real, Pointer, contiguous :: inA(:,:), inB(:,:), out(:,:) contains Procedure, Pass :: ad => subad End type TestType contains Subroutine Subad(this) Class(testtype), intent(inout) :: this this%out=this%inA+this%inB End Subroutine Subad End Module TestModule Program Test Use Testmodule Type(testtype), Target :: tt tt%out=>tt%e tt%inA=>tt%a;tt%inB=>tt%b call tt%ad() tt%out=>tt%e tt%inA=>tt%c;tt%inB=>tt%d call tt%ad() End Program Test
Maybe Tim P. has already answered above the question and I didn't get. So any comment is much appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try
!dir$ vector always
this%out=this%inA+this%inB
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. Thanks for the quick response. But the -qopt-report=5 report shows hardly any difference between with !dir$ vector always and without.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you examine the assembly code?
Advisor and Amplifier, each, has a button for expanding the statement to assembly
If the code is vectorized you should see instructions ending in ps (e.g. vmovps, vmovups, vaddps), ps == packed single, pd == packed double. If you see instructions ending in ss (or sd), these are non-vector scalar instructions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vector always doesn't suspend dependency analysis, as ivdep and simd directives do. Omp simd is the most portable of those.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Vector always doesn't suspend dependency analysis
Then the directive is misnamed (and documentation is incomplete).
!dir$ vector, without always, should not suspend dependency analysis, and for code determined as non-dependent, vectorize code that does not look to the compiler as being efficient for vectorization.
I can see using !dir$ simd, but not necessarily !$omp simd... because the code may not necessarily be compiled with -openmp
IMHO !$omp simd, when used when compiling for OpenMP, should not only invoke simd code, but additionally enforce partitioning of the loop at cache line boundaries (or at least at SIMD boundaries).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What I see in your new examples is that if inlining is disabled with -fno-inline, the loops get vectorized whether or not they are pointers. If inlining is allowed, then for the second )pointer) example, two versions are created for the loop, one (not vectorized) for when the pointees overlap, and one (vectorized) for when they do not. So in this simple case, pointers are not really hurting. If the (implied) loops were much more complex, e.g. involving several more pointers, the compiler might not be able to generate multiple loop versions, and you might need to help with a directive as suggested. In the first example, with allocatable arrays, the loop is not vectorized at all if it is inlined. I don’t know why this is.
!DIR$ VECTOR ALWAYS simply overrides the vectorizer’s internal cost model. From the Reference Guide:
“The ALWAYS clause overrides efficiency heuristics of the vectorizer, but it only works if the loop can actually be vectorized … You should use the IVDEP directive to ignore assumed dependences.”
The IVDEP directive asserts that there are no backward potential dependencies (forward dependencies make threading unsafe, but not vectorization). If the compiler is able to prove such a dependency, it still will not vectorize.
With !$OMP SIMD the compiler may not do any dependency analysis. It will vectorize, even if a dependency could be proven, but then you’d likely get incorrect results. Note that -qopenmp-simd activates OMP SIMD directives without enabling the threading part of OpenMP. Directives such as !DIR$ SIMD mentioned by Jim are part of Intel Cilk Plus and are deprecated, since OpenMP now has comparable functionality.
Note that the directives !DIR$ IVDEP and !$OMP SIMD cannot be applied to array assignments such as in your code. They would have to be written as DO loops – which is what you’re trying to avoid, because then you’d need to extract the upper and lower array bounds. Perhaps this may be addressed in a future compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note also that your 2D array assignments represent two nested loops. If a directive is place before such an assignment, there is an ambiguity as to which of the nested loops it should apply to.
I'm not sure why the inlined loop is not vectorized in your allocatable array example, so I've escalated that to the compiler developers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for all the comments, but I may have oversimplified the example. An example getting closer to my real world problem is this:
Module TestModule Type :: TestType Real, Allocatable :: a(:,:),b(:,:),c(:,:), d(:,:), e(:,:) Real, Pointer, contiguous :: inA(:,:), inB(:,:), out(:,:) contains Procedure, Pass :: ad => subad Procedure, Pass :: adab => subadab Procedure, Pass :: adcd => subadcd End type TestType Interface Module Subroutine Subad(this) Class(testtype), intent(inout) :: this End Subroutine Module Subroutine Subadab(this) Class(testtype), intent(inout), Target :: this End Subroutine Module Subroutine Subadcd(this) Class(testtype), intent(inout), Target :: this End Subroutine End Interface End Module TestModule SubModule(TestModule) add contains Module Procedure SubAdab this%out=>this%e this%inA=>this%a;this%inB=>this%b call this%ad() End Procedure Module Procedure SubAdcd this%out=>this%e this%inA=>this%c;this%inB=>this%d call this%ad() End Procedure Module Procedure Subad this%out=this%inA+this%inB End Procedure End SubModule add Program Test Use Testmodule Type(testtype), Target :: tt call tt%adab() call tt%adcd() End Program Test
The idea is that the operations in "subad" do not change for most of the child classes, but the arrays used change in the arent class as well as in all child classes. Thus "subab" is nicely inherirted and the calling Subroutines are overwritten. This safes me from writing/maintaining several thousand lines of code which just differ in the array used. Now the key difference to the second example in #10 seems to be a compile time output "[[ Unable to inline indirect callsite <1>]]". In my real world application the test program itself is not an executable but a routine and everything sits in a static library. When I moved from calling sequence in #10 to this calling sequence I lost about 15% in speed.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I intialized the arrays in all there examples above with
integer :: i=1000000,j=1000 Allocate(tt%a(i,j),tt%b(i,j),tt%c(i,j),tt%d(i,j),tt%e(i,j),source=0.0)
when compiling with -03 I get the following timings:
ifort: #1 4.4sec, #2 5.9sec, #3 5.9sec
gfortran: #1 4.4sec, #2 4.4sec, #3 4.4sec
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Martyn and May,
I've recently have had an issue of failure to vectorize a 2D array, where the array slice indices were not explicitly specified. In my case I had something like:
real(8) :: temp(6),a(6,6),b(6,6)
...
temp = a(:,j) * b(:,j)
Using compiler directives would not get this to vectorize. On a whim, can't recall why, I changed the line to
do j=1,6
...
temp = a(1:6,j) * b(1:6,j)
and this finally vectorized. This was on KNL.
May, try adding
integer, parameter :: iDim = 1000000, jDim=1000
Use those in your allocations (or place into TestType as variables). Then use
this%out(1:iDim,1:jDim) = this%inA(1:iDim,1:jDim) + this%inB(1:iDim,1:jDim)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
thanks for looking into this. I tried
this%out(1:size(this%Out,1),1:size(this%Out,2))=& &this%inA(1:size(this%Out,1),1:size(this%Out,2))+& &this%inB(1:size(this%Out,1),1:size(this%Out,2))
but that did nothing to the processing time differences I found above.
Cheers
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page