Hi,

may_ka · ‎07-22-2017

Hi there,

in the course of modernizing some of my code I have moved to putting many "call by refernce" routines which were previously "stand-alone" behind types. In addition arrays are now passed using type bound pointers, usually with the "contiguous" attribute. This has the big advantage of avoiding to pass array boundaries explicitly (because many of my arrays start at index value zero). However, I have noticed a speed difference to an extent that the type bound routines need up to twice as much time for processing large arrays than the direct routines.

The below program mimics the above structure. It implements the frist part of an implicit multiplication of a 4.5 Mio x 100 matrix with an 4.8 Mio x 4.8 Mio structured sparse co-variance matrix (the latter matrix can be stored in form of four vectors and is held by the type). This routine needs about 4.8 seconds when called directly, and about 6.6 seconds when called through the type. This is n ot a big difference in absolute numbers, but sums up when performed several thousand times. Given that the type transports the array into the routine via a pointer with attribute "contiguous" this difference in speed should not appear. however, may be I haven't understood the standard incorrectly. The speed was measured on a Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 processors. The compiler was ifort 17.4. The data set can be supplied on request.

Any Ideas?

Thanks a lot

Module Mod_Direct
contains
  Subroutine SubDrop(ISUBound1,ISUBound2,RMInOut,ivi,ivs&
    &,ivd,rvp,ISNThreads)
    !$ use omp_lib
    Implicit None
    Integer*8, Intent(In) :: ISUbound1, ISUBound2
    Real*8, Intent(InOut) :: RMInOut(0:ISUbound1,1:ISUBound2)
    Integer*8, Intent(In) :: ivs(:), ivd(:), ivi(:)
    Real*8, Intent(In) :: RVp(:)
    Integer*4, intent(in), optional :: ISNThreads
    Integer*8 :: c1, c2, ss, dd, ii
    outer:block
      RMInOut(0,:)=0.0D00
      !$ if(present(ISNThreads)) Then
      !$   if(ISNThreads>size(RMInOUt,2)) Then
      !$     call omp_set_num_threads(size(RMInOut,2))
      !$   else
      !$     call omp_set_num_threads(ISNThreads)
      !$   End if
      !$ else
      !$   c1=omp_get_max_threads()
      !$   if(c1>size(RMInout,2)) Then
      !$     call omp_set_num_threads(size(RMInout,2))
      !$   else
      !$     call omp_set_num_threads(int(c1,4))
      !$   End if
      !$ end if
      !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
      Do c1=1,size(RMInOut,2)
        Do c2=1,size(IVI,1)
          ss=ivs(c2)
          dd=ivd(c2)
          ii=ivi(c2)
          RMInOut(ii,c1)=RMInOut(ii,c1)+rvp(c2)*(RMInOut(ss,c1)&
            &+RMInOut(dd,c1))
        End Do
      End Do
      !$OMP END PARALLEL DO
    End block outer
  End Subroutine SubDrop
end Module Mod_Direct
Module Mod_Type
  Type :: Testtype
    Integer*8, Allocatable :: ivi(:), ivs(:), ivd(:)
    Integer*8 :: isn
    Integer*4 :: ISSubStat
    Real*8, Allocatable :: rvp(:)
    Real*8, Pointer, contiguous :: RMInout(:,:)
    Character(:), allocatable :: csmsg
  contains
    procedure, pass, public :: drop=>subdrop
  End type Testtype
  Interface
    Module Subroutine SubDrop(this,ISNThreads)
      Class(TestType) :: this
      Integer*4, optional :: ISNThreads
    end Subroutine
  End Interface
  Private :: SubDrop
end Module Mod_Type
SubModule(Mod_Type) Drop
contains
  Module Procedure SubDrop
  !$ use omp_lib
    Implicit None
    Integer*8 :: c1, c2, ss, dd, ii
    outer:block
      if(.not.associated(this%RMInOut)) Then
        this%CSMSG="ERROR"
        this%ISSubStat=1;exit outer
      end if
      if(lbound(this%RMInOut,1)/=0) Then
        this%CSMSG="ERROR"
        this%ISSubStat=1;exit outer
      End if
      if(ubound(this%RMInOut,1)/=this%isn) Then
        this%CSMSG="ERROR"
        this%ISSubStat=1;exit outer
      End if
      this%RMInOut(0,:)=0.0D0
      !$ if(present(ISNThreads)) Then
      !$   if(ISNThreads>size(this%RMInOUt,2)) Then
      !$     call omp_set_num_threads(size(this%RMInOut,2))
      !$   else
      !$     call omp_set_num_threads(ISNThreads)
      !$   End if
      !$ else
      !$   c1=omp_get_max_threads()
      !$   if(c1>size(this%RMInout,2)) Then
      !$     call omp_set_num_threads(size(this%RMInout,2))
      !$   else
      !$     call omp_set_num_threads(int(c1,4))
      !$   End if
      !$ end if
      !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
      Do c1=1,size(this%RMInOut,2)
        Do c2=1,size(this%ivi,1)
          ss=this%ivs(c2)
          dd=this%ivd(c2)
          ii=this%Ivi(c2)
          this%RMInOut(ii,c1)=this%RMInOut(ii,c1)+this%RVP(c2)&
            &*(this%RMInOut(ss,c1)+this%RMInOut(dd,c1))
        End Do
      End Do
      !$OMP END PARALLEL DO
    End block outer
  end Procedure
End SubModule Drop
Program Test
  use Mod_Type
  use Mod_Direct
  Implicit none
  Type(TestType) :: TST
  integer :: dim=4876565, dim3=100, c1
  real*8, target, allocatable :: rmtmp(:,:)
  real*8 :: t0, t1
  !$ call omp_set_nested(.TRUE.)
  Allocate(TST%ivi(dim),TST%ivs(dim),TST%ivd(dim),TST&
    &%rvp(dim))
  open(55,file="input.txt",action="read")
  Do c1=1,dim
    read(55,*) TST%ivi(c1),tst%ivs(c1),tst%ivd(c1),tst%rvp(c1)
  end Do
  tst%isn=maxval(tst%ivi)
  Allocate(rmtmp(0:tst%isn,dim3),source=0.0D0)
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  TST%RMInOut=>rmtmp
  call TST%drop()
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  !!call SubDrop(ISUBound1=Int(tst%isn,8),ISUBound2=Int(dim3,8),RMInout&
  !!  &=rmtmp,ivi=tst%ivi,ivs=tst%ivs,ivd=tst%ivd,rvp=tst%rvp)
End Program Test

TimP · ‎07-22-2017

Were you using default integer indexing previously? Could you show the comparison reports with -qopt-report=4 ?

The IBM360 integer*8 idiom hardly counts as modernization.

may_ka · ‎07-22-2017

Hi,

thanks for the response. I only used the "*8" syntax for this example. In "real" world I have a "select real kind" function etc.

What does you first question mean??

I have the following additional facts: no speed difference when running on a single core. Speed difference almost vanished when avoiding "this%" inside the omp loops by putting an "associate" statement around the looks.

I changed the code slightly before getting the optimization report, so the code behind the report and the report are below. I did the compilation on a Intel(R) Core(TM) i7-6820HK CPU @ 2.70GHz with 8 processors.

thanks

Module Mod_Direct
contains
  Subroutine SubDrop(ISUBound1,ISUBound2,RMInOut,ivi,ivs&
    &,ivd,rvp,ISNThreads)
    !$ use omp_lib
    Implicit None
    Integer*8, Intent(In) :: ISUbound1, ISUBound2
    Real*8, Intent(InOut) :: RMInOut(0:ISUbound1,1:ISUBound2)
    Integer*8, Intent(In) :: ivs(:), ivd(:), ivi(:)
    Real*8, Intent(In) :: RVp(:)
    Integer*4, intent(in), optional :: ISNThreads
    Integer*8 :: c1, c2, ss, dd, ii
    outer:block
      RMInOut(0,:)=0.0D00
      !$ if(present(ISNThreads)) Then
      !$   if(ISNThreads>size(RMInOUt,2)) Then
      !$     call omp_set_num_threads(size(RMInOut,2))
      !$   else
      !$     call omp_set_num_threads(ISNThreads)
      !$   End if
      !$ else
      !$   c1=omp_get_max_threads()
      !$   if(c1>size(RMInout,2)) Then
      !$     call omp_set_num_threads(size(RMInout,2))
      !$   else
      !$     call omp_set_num_threads(int(c1,4))
      !$   End if
      !$ end if
      !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
      Do c1=1,size(RMInOut,2)
        Do c2=1,size(IVI,1)
          ss=ivs(c2)
          dd=ivd(c2)
          ii=ivi(c2)
          RMInOut(ii,c1)=RMInOut(ii,c1)+rvp(c2)*(RMInOut(ss,c1)&
            &+RMInOut(dd,c1))
        End Do
      End Do
      !$OMP END PARALLEL DO
    End block outer
  End Subroutine SubDrop
end Module Mod_Direct
Module Mod_Type
  Type :: Testtype
    Integer*8, Allocatable :: ivi(:), ivs(:), ivd(:)
    Integer*8 :: isn
    Integer*4 :: ISSubStat
    Real*8, Allocatable :: rvp(:)
    Real*8, Pointer, contiguous:: RMInout(:,:)
    Character(:), allocatable :: csmsg
  contains
    procedure, pass, public :: drop=>subdrop
  End type Testtype
  Interface
    Module Subroutine SubDrop(this,ISNThreads)
      Class(TestType) :: this
      Integer*4, optional :: ISNThreads
    end Subroutine
  End Interface
  Private :: SubDrop
end Module Mod_Type
SubModule(Mod_Type) Drop
contains
  Module Procedure SubDrop
  !$ use omp_lib
    Implicit None
    Integer*8 :: c1, c2, ss, dd, ii
    outer:block
      this%RMInOut(0,:)=0.0D0
      !$ if(present(ISNThreads)) Then
      !$   if(ISNThreads>size(this%RMInOUt,2)) Then
      !$     call omp_set_num_threads(size(this%RMInOut,2))
      !$   else
      !$     call omp_set_num_threads(ISNThreads)
      !$   End if
      !$ else
      !$   c1=omp_get_max_threads()
      !$   if(c1>size(this%RMInout,2)) Then
      !$     call omp_set_num_threads(size(this%RMInout,2))
      !$   else
      !$     call omp_set_num_threads(int(c1,4))
      !$   End if
      !$ end if
      !$OMP PARALLEL DO PRIVATE(ss,dd,c2,c1)
      Do c1=1,size(this%RMInOut,2)
        Do c2=1,size(this%ivi,1)
          ss=this%ivs(c2)
          dd=this%ivd(c2)
          ii=this%Ivi(c2)
          this%RMInOut(ii,c1)=this%RMInOut(ii,c1)+this%RVP(c2)&
            &*(this%RMInOut(ss,c1)+this%RMInOut(dd,c1))
        End Do
      End Do
      !$OMP END PARALLEL DO
    End block outer
  end Procedure
End SubModule Drop
Program Test
  use Mod_Type
  use Mod_Direct
  Implicit none
  Type(TestType) :: TST
  integer :: dim=4876565, dim3=500, c1
  real*8, target, allocatable :: rmtmp(:,:)
  real*8 :: t0, t1
  Character(len=10) :: time
  !$ call omp_set_nested(.TRUE.)
  Allocate(TST%ivi(dim),TST%ivs(dim),TST%ivd(dim),TST&
    &%rvp(dim))
  open(55,file="input.txt",action="read")
  Do c1=1,dim
    read(55,*) TST%ivi(c1),tst%ivs(c1),tst%ivd(c1),tst%rvp(c1)
  end Do
  tst%isn=maxval(tst%ivi)
  Allocate(rmtmp(0:tst%isn,dim3),source=0.0D0)
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  write(*,*) is_contiguous(rmtmp)
  TST%RMInOut=>rmtmp
  write(*,*) is_contiguous(TST%RMInOut)
  call date_and_time(time=time)
  write(*,*) time
  call TST%drop()
  call date_and_time(time=time)
  write(*,*) time
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  call date_and_time(time=time)
  write(*,*) time
  call SubDrop(ISUBound1=Int(tst%isn,8),ISUBound2=Int(dim3,8),rminout&
    &=rmtmp,ivi=tst%ivi,ivs=tst%ivs,ivd=tst%ivd,rvp=tst%rvp)
  call date_and_time(time=time)
  write(*,*) time
End Program Test

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411

Compiler options: -O3 -qopenmp -static -qopt-report=4

    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined.
   "isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: TEST

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (TEST) [1/6=16.7%] Test.f90(98,9)
  -> EXTERN: (98,9) for_set_reentrancy
  -> EXTERN: (107,11) omp_set_nested_
  -> EXTERN: (108,3) for_alloc_allocatable
  -> EXTERN: (108,3) for_check_mult_overflow64
  -> EXTERN: (108,3) for_alloc_allocatable
  -> EXTERN: (108,3) for_check_mult_overflow64
  -> EXTERN: (108,3) for_alloc_allocatable
  -> EXTERN: (108,3) for_check_mult_overflow64
  -> EXTERN: (108,3) for_alloc_allocatable
  -> EXTERN: (108,3) for_check_mult_overflow64
  -> EXTERN: (110,3) for_open
  -> EXTERN: (112,5) for_read_seq_lis_xmit
  -> EXTERN: (112,5) for_read_seq_lis_xmit
  -> EXTERN: (112,5) for_read_seq_lis_xmit
  -> EXTERN: (112,5) for_read_seq_lis
  -> EXTERN: (115,3) for_alloc_allocatable
  -> EXTERN: (115,3) for_check_mult_overflow64
  -> EXTERN: (117,3) for_write_seq_lis
  -> EXTERN: (119,3) for_write_seq_lis
  -> EXTERN: (120,8) for_date_and_time
  -> EXTERN: (121,3) for_write_seq_lis
  -> (122,8) MOD_TYPE^SUBDROP (isz = 261) (sz = 268)
     [[ Unable to inline callsite ]]
  -> EXTERN: (123,8) for_date_and_time
  -> EXTERN: (124,3) for_write_seq_lis
  -> EXTERN: (126,8) for_date_and_time
  -> EXTERN: (127,3) for_write_seq_lis
  -> (128,8) SUBDROP (isz = 199) (sz = 218)
     [[ Unable to inline callsite ]]
  -> EXTERN: (130,8) for_date_and_time
  -> EXTERN: (131,3) for_write_seq_lis


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at Test.f90(111,3)
   remark #15382: vectorization support: call to function for_read_seq_lis cannot be vectorized   [ Test.f90(112,5) ]
   remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized   [ Test.f90(112,5) ]
   remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized   [ Test.f90(112,5) ]
   remark #15382: vectorization support: call to function for_read_seq_lis_xmit cannot be vectorized   [ Test.f90(112,5) ]
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed OUTPUT dependence between at (112:5) and at (112:5)
LOOP END

LOOP BEGIN at Test.f90(114,11)
   remark #15388: vectorization support: reference TST(:) has aligned access
   remark #15305: vectorization support: vector length 2
   remark #15399: vectorization support: unroll factor set to 4
   remark #15309: vectorization support: normalized vectorization overhead 0.528
   remark #15300: LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 6 
   remark #15477: vector cost: 4.500 
   remark #15478: estimated potential speedup: 1.300 
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at Test.f90(114,11)
<Remainder loop for vectorization>
LOOP END

LOOP BEGIN at Test.f90(115,12)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at Test.f90(115,12)
      remark #25408: memset generated
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at Test.f90(115,12)
         remark #15389: vectorization support: reference RMTMP(:,:) has unaligned access
         remark #15381: vectorization support: unaligned access used inside loop body
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15309: vectorization support: normalized vectorization overhead 0.300
         remark #15300: LOOP WAS VECTORIZED
         remark #15451: unmasked unaligned unit stride stores: 1 
         remark #15475: --- begin vector cost summary ---
         remark #15476: scalar cost: 4 
         remark #15477: vector cost: 2.500 
         remark #15478: estimated potential speedup: 1.450 
         remark #15488: --- end vector cost summary ---
         remark #25015: Estimate of max trip count of loop=3
      LOOP END

      LOOP BEGIN at Test.f90(115,12)
      <Remainder loop for vectorization>
         remark #25015: Estimate of max trip count of loop=12
      LOOP END
   LOOP END
LOOP END

    Report from: Code generation optimizations [cg]

Test.f90(98,9):remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (16, 0)
Test.f90(115,12):remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
Test.f90(115,12):remark #34026: call to memset implemented as a call to optimized library version
Test.f90(118,3):remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (16, 0)
Test.f90(98,9):remark #34051: REGISTER ALLOCATION : [MAIN__] Test.f90:98

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   30[ rax rdx rcx rbx rsi rdi r8-r15 zmm0-zmm15]
        
    Routine temporaries
        Total         :     430
            Global    :      58
            Local     :     372
        Regenerable   :     179
        Spilled       :       7
        
    Routine stack
        Variables     :     894 bytes*
            Reads     :      31 [6.34e+01 ~ 4.3%]
            Writes    :      91 [1.51e+02 ~ 10.2%]
        Spills        :      16 bytes*
            Reads     :       2 [5.00e+00 ~ 0.3%]
            Writes    :       2 [3.85e+00 ~ 0.3%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

Begin optimization report for: MOD_TYPE^SUBDROP

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (MOD_TYPE^SUBDROP) [2/6=33.3%] Test.f90(63,1)
  -> EXTERN: (72,19) omp_set_num_threads
  -> EXTERN: (74,19) omp_set_num_threads
  -> EXTERN: (77,15) omp_get_max_threads
  -> EXTERN: (79,19) omp_set_num_threads
  -> EXTERN: (81,19) omp_set_num_threads


    Report from: OpenMP optimizations [openmp]

Test.f90(84:13-84:13):OMP:mod_type_mp_subdrop_:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at Test.f90(69,7)
   remark #15329: vectorization support: non-unit strided store was emulated for the variable <at (69:7)>, stride is unknown to compiler
   remark #15305: vectorization support: vector length 2
   remark #15399: vectorization support: unroll factor set to 4
   remark #15300: LOOP WAS VECTORIZED
   remark #15453: unmasked strided stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 4 
   remark #15477: vector cost: 3.000 
   remark #15478: estimated potential speedup: 1.320 
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at Test.f90(69,7)
<Remainder loop for vectorization>
LOOP END

LOOP BEGIN at Test.f90(85,7)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between this(c2) (87:11) and this(this(c2),c1) (90:11)

   LOOP BEGIN at Test.f90(86,9)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed ANTI dependence between this(c2) (87:11) and this(this(c2),c1) (90:11)
   LOOP END
LOOP END

    Report from: Code generation optimizations [cg]

Test.f90(63,1):remark #34051: REGISTER ALLOCATION : [mod_type_mp_subdrop_] Test.f90:63

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   16[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0]
        
    Routine temporaries
        Total         :     177
            Global    :      49
            Local     :     128
        Regenerable   :      41
        Spilled       :       3
        
    Routine stack
        Variables     :      88 bytes*
            Reads     :       5 [2.97e-01 ~ 0.0%]
            Writes    :       8 [2.51e+01 ~ 2.2%]
        Spills        :      72 bytes*
            Reads     :      15 [1.76e+01 ~ 1.5%]
            Writes    :      15 [1.93e+01 ~ 1.7%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

Begin optimization report for: SUBDROP

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (SUBDROP) [3/6=50.0%] Test.f90(3,14)
  -> EXTERN: (17,19) omp_set_num_threads
  -> EXTERN: (19,19) omp_set_num_threads
  -> EXTERN: (22,15) omp_get_max_threads
  -> EXTERN: (24,19) omp_set_num_threads
  -> EXTERN: (26,19) omp_set_num_threads


    Report from: OpenMP optimizations [openmp]

Test.f90(29:13-29:13):OMP:mod_direct_mp_subdrop_:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at Test.f90(14,7)
   remark #15329: vectorization support: non-unit strided store was emulated for the variable <RMINOUT(0,:)>, stride is unknown to compiler
   remark #15305: vectorization support: vector length 2
   remark #15399: vectorization support: unroll factor set to 4
   remark #15300: LOOP WAS VECTORIZED
   remark #15453: unmasked strided stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 4 
   remark #15477: vector cost: 3.000 
   remark #15478: estimated potential speedup: 1.320 
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at Test.f90(14,7)
<Remainder loop for vectorization>
LOOP END

LOOP BEGIN at Test.f90(30,7)
<Multiversioned v1>
   remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between rvp(c2) (35:11) and rminout(ivi(c2),c1) (35:11)

   LOOP BEGIN at Test.f90(31,9)
      remark #25084: Preprocess Loopnests: Moving Out Store    [ Test.f90(34,11) ]
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed ANTI dependence between rvp(c2) (35:11) and rminout(ivi(c2),c1) (35:11)
   LOOP END
LOOP END

LOOP BEGIN at Test.f90(30,7)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning

   LOOP BEGIN at Test.f90(31,9)
      remark #25084: Preprocess Loopnests: Moving Out Store    [ Test.f90(34,11) ]
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed ANTI dependence between rminout(dd,c1) (35:11) and rminout(ivi(c2),c1) (35:11)
   LOOP END
LOOP END

    Report from: Code generation optimizations [cg]

Test.f90(3,14):remark #34051: REGISTER ALLOCATION : [mod_direct_mp_subdrop_] Test.f90:3

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   16[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0]
        
    Routine temporaries
        Total         :     210
            Global    :      61
            Local     :     149
        Regenerable   :      60
        Spilled       :      11
        
    Routine stack
        Variables     :     296 bytes*
            Reads     :       7 [2.97e-01 ~ 0.0%]
            Writes    :      27 [4.41e+01 ~ 5.3%]
        Spills        :     136 bytes*
            Reads     :      26 [4.19e+01 ~ 5.0%]
            Writes    :      23 [2.24e+01 ~ 2.7%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

Begin optimization report for: mod_direct._

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (mod_direct._) [4/6=66.7%] Test.f90(1,8)


    Report from: Code generation optimizations [cg]

Test.f90(1,8):remark #34051: REGISTER ALLOCATION : [mod_direct._] Test.f90:1

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :    0[ reg_null]
        
    Routine temporaries
        Total         :       6
            Global    :       0
            Local     :       6
        Regenerable   :       0
        Spilled       :       0
        
    Routine stack
        Variables     :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
        Spills        :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

Begin optimization report for: mod_type._

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (mod_type._) [5/6=83.3%] Test.f90(43,8)


    Report from: Code generation optimizations [cg]

Test.f90(43,8):remark #34051: REGISTER ALLOCATION : [mod_type._] Test.f90:43

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :    0[ reg_null]
        
    Routine temporaries
        Total         :       6
            Global    :       0
            Local     :       6
        Regenerable   :       0
        Spilled       :       0
        
    Routine stack
        Variables     :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
        Spills        :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

Begin optimization report for: mod_type.drop._

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (mod_type.drop._) [6/6=100.0%] Test.f90(62,21)


    Report from: Code generation optimizations [cg]

Test.f90(62,21):remark #34051: REGISTER ALLOCATION : [mod_type.drop._] Test.f90:62

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :    0[ reg_null]
        
    Routine temporaries
        Total         :       6
            Global    :       0
            Local     :       6
        Regenerable   :       0
        Spilled       :       0
        
    Routine stack
        Variables     :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
        Spills        :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

TimP · ‎07-22-2017

Apparently (not knowing your original version), you have inadvertently prevented the compiler from assuming that rvp and rminout don't overlap. If those were originally separate dummy arguments, the rules of Fortran would require them not to overlap, and the compiler could take advantage of that. You could try throwing !$omp simd on the inner loop to persuade to ignore possible overlap.

may_ka · ‎07-23-2017

Hi. Thanks for the reply.

I tried to figure out about simd but from what I understood it is about loop vectorization. So could you specifiy how to use it for getting the compiler to assume independency between the two arrays.

Thanks

may_ka · ‎07-23-2017

Hi, another question is what part of the compilation output reveals the problem you mentioned.

Thanks

TimP · ‎07-23-2017

The report has anti dependence between rvp and rminout. If you wish to try optimization without vectorization you could set !dir$ ivdep and !dir$ novector

Martyn_C_Intel · ‎08-14-2017

Hi,

For the OpenMP version, you are missing the variable ii from the private clauses on lines 29 and 84, which for some reason impacts the performance of the type bound version more than the direct one. But since the results are presumably incorrect unless ii is declared private, this is unimportant.

With ii in the private clause, speeds are comparable (and much faster).

Another difference is that the dummy argument of SubDrop in the direct version is an adjustable array, whereas the derived type component is a pointer. In some circumstances, pointers may be harder to optimize because of the possibilities of aliasing or non-unit strides. That doesn't look to be the case here, though.

Martyn_C_Intel · ‎08-14-2017

Incidentally, Intel Inspector is an excellent tool for finding errors like this in threaded programs. Spotting them "by hand" (or by eye?) gets harder as the programs get bigger.

may_ka · ‎08-16-2017

Thanks Martyn. The bug in the code was luckily introduced when putting the example together.

However, my problem is not really solved. The two blocks of code below do essentiall the same. But the second one spares me from writing routines. In the first block, the compiler knows that all involved arrays have a unique memory address. In the second block the compiler does not know where the pointers go which may affect optimization. Is there any way to tell the compiler that all pointer involved in the addition point to different memory locations??

Module TestModule
  Type :: TestType
    Real, Allocatable, dimension(:,:) :: a,b,c,d,e
  contains
    Procedure, Pass :: adab => subadab
    Procedure, Pass :: adcd => subadcd
  End type TestType
contains
  Subroutine Subadab(this)
    Class(testtype), intent(inout) :: this
    this%e=this%a+this%b
  End Subroutine Subadab
  Subroutine Subadcd(this)
    Class(testtype), intent(inout) :: this
    this%e=this%c+this%d
  End Subroutine Subadcd
End Module TestModule
Program Test
  Use Testmodule
  Type(testtype) :: tt
  call tt%adab()
  call tt%adcd()
End Program Test

Module TestModule
  Type :: TestType
    Real, Allocatable :: a(:,:),b(:,:),c(:,:), d(:,:), e(:,:)
    Real, Pointer, contiguous :: inA(:,:), inB(:,:), out(:,:)
  contains
    Procedure, Pass :: ad => subad
  End type TestType
contains
  Subroutine Subad(this)
    Class(testtype), intent(inout) :: this
    this%out=this%inA+this%inB
  End Subroutine Subad
End Module TestModule
Program Test
  Use Testmodule
  Type(testtype), Target :: tt
  tt%out=>tt%e
  tt%inA=>tt%a;tt%inB=>tt%b
  call tt%ad()
  tt%out=>tt%e
  tt%inA=>tt%c;tt%inB=>tt%d
  call tt%ad()
End Program Test

Maybe Tim P. has already answered above the question and I didn't get. So any comment is much appreciated.

jimdempseyatthecove · ‎08-16-2017

Try

!dir$ vector always
this%out=this%inA+this%inB

Jim Dempsey

may_ka · ‎08-16-2017

Hi. Thanks for the quick response. But the -qopt-report=5 report shows hardly any difference between with !dir$ vector always and without.

Thanks

jimdempseyatthecove · ‎08-16-2017

Did you examine the assembly code?

Advisor and Amplifier, each, has a button for expanding the statement to assembly

If the code is vectorized you should see instructions ending in ps (e.g. vmovps, vmovups, vaddps), ps == packed single, pd == packed double. If you see instructions ending in ss (or sd), these are non-vector scalar instructions.

Jim Dempsey

TimP · ‎08-16-2017

Vector always doesn't suspend dependency analysis, as ivdep and simd directives do. Omp simd is the most portable of those.

jimdempseyatthecove · ‎08-16-2017

>>Vector always doesn't suspend dependency analysis

Then the directive is misnamed (and documentation is incomplete).
!dir$ vector, without always, should not suspend dependency analysis, and for code determined as non-dependent, vectorize code that does not look to the compiler as being efficient for vectorization.

I can see using !dir$ simd, but not necessarily !$omp simd... because the code may not necessarily be compiled with -openmp

IMHO !$omp simd, when used when compiling for OpenMP, should not only invoke simd code, but additionally enforce partitioning of the loop at cache line boundaries (or at least at SIMD boundaries).

Jim Dempsey

Martyn_C_Intel · ‎08-16-2017

What I see in your new examples is that if inlining is disabled with -fno-inline, the loops get vectorized whether or not they are pointers. If inlining is allowed, then for the second )pointer) example, two versions are created for the loop, one (not vectorized) for when the pointees overlap, and one (vectorized) for when they do not. So in this simple case, pointers are not really hurting. If the (implied) loops were much more complex, e.g. involving several more pointers, the compiler might not be able to generate multiple loop versions, and you might need to help with a directive as suggested. In the first example, with allocatable arrays, the loop is not vectorized at all if it is inlined. I don’t know why this is.

!DIR$ VECTOR ALWAYS simply overrides the vectorizer’s internal cost model. From the Reference Guide:

“The ALWAYS clause overrides efficiency heuristics of the vectorizer, but it only works if the loop can actually be vectorized … You should use the IVDEP directive to ignore assumed dependences.”

The IVDEP directive asserts that there are no backward potential dependencies (forward dependencies make threading unsafe, but not vectorization). If the compiler is able to prove such a dependency, it still will not vectorize.

With !$OMP SIMD the compiler may not do any dependency analysis. It will vectorize, even if a dependency could be proven, but then you’d likely get incorrect results. Note that -qopenmp-simd activates OMP SIMD directives without enabling the threading part of OpenMP. Directives such as !DIR$ SIMD mentioned by Jim are part of Intel Cilk Plus and are deprecated, since OpenMP now has comparable functionality.

Note that the directives !DIR$ IVDEP and !$OMP SIMD cannot be applied to array assignments such as in your code. They would have to be written as DO loops – which is what you’re trying to avoid, because then you’d need to extract the upper and lower array bounds. Perhaps this may be addressed in a future compiler.

Martyn_C_Intel · ‎08-16-2017

Note also that your 2D array assignments represent two nested loops. If a directive is place before such an assignment, there is an ambiguity as to which of the nested loops it should apply to.

I'm not sure why the inlined loop is not vectorized in your allocatable array example, so I've escalated that to the compiler developers.

may_ka · ‎08-16-2017

Thanks for all the comments, but I may have oversimplified the example. An example getting closer to my real world problem is this:

Module TestModule
  Type :: TestType
    Real, Allocatable :: a(:,:),b(:,:),c(:,:), d(:,:), e(:,:)
    Real, Pointer, contiguous :: inA(:,:), inB(:,:), out(:,:)
  contains
    Procedure, Pass :: ad => subad
    Procedure, Pass :: adab => subadab
    Procedure, Pass :: adcd => subadcd
  End type TestType
  Interface
    Module Subroutine Subad(this)
      Class(testtype), intent(inout) :: this
    End Subroutine
    Module Subroutine Subadab(this)
      Class(testtype), intent(inout), Target :: this
    End Subroutine
    Module Subroutine Subadcd(this)
      Class(testtype), intent(inout), Target :: this
    End Subroutine
  End Interface
End Module TestModule
SubModule(TestModule) add
contains
  Module Procedure SubAdab
    this%out=>this%e
    this%inA=>this%a;this%inB=>this%b
    call this%ad()
  End Procedure
  Module Procedure SubAdcd
    this%out=>this%e
    this%inA=>this%c;this%inB=>this%d
    call this%ad()
  End Procedure
  Module Procedure Subad
    this%out=this%inA+this%inB
  End Procedure
End SubModule add
Program Test
  Use Testmodule
  Type(testtype), Target :: tt
  call tt%adab()
  call tt%adcd()
End Program Test

The idea is that the operations in "subad" do not change for most of the child classes, but the arrays used change in the arent class as well as in all child classes. Thus "subab" is nicely inherirted and the calling Subroutines are overwritten. This safes me from writing/maintaining several thousand lines of code which just differ in the array used. Now the key difference to the second example in #10 seems to be a compile time output "[[ Unable to inline indirect callsite <1>]]". In my real world application the test program itself is not an executable but a routine and everything sits in a static library. When I moved from calling sequence in #10 to this calling sequence I lost about 15% in speed.

Thanks

may_ka · ‎08-16-2017

I intialized the arrays in all there examples above with

 integer :: i=1000000,j=1000
  Allocate(tt%a(i,j),tt%b(i,j),tt%c(i,j),tt%d(i,j),tt%e(i,j),source=0.0)

when compiling with -03 I get the following timings:

ifort: #1 4.4sec, #2 5.9sec, #3 5.9sec

gfortran: #1 4.4sec, #2 4.4sec, #3 4.4sec

jimdempseyatthecove · ‎08-17-2017

Martyn and May,

I've recently have had an issue of failure to vectorize a 2D array, where the array slice indices were not explicitly specified. In my case I had something like:

real(8) :: temp(6),a(6,6),b(6,6)
...
temp = a(:,j) * b(:,j)

Using compiler directives would not get this to vectorize. On a whim, can't recall why, I changed the line to

do j=1,6
...
temp = a(1:6,j) * b(1:6,j)

and this finally vectorized. This was on KNL.

May, try adding

integer, parameter :: iDim = 1000000, jDim=1000

Use those in your allocations (or place into TestType as variables). Then use

this%out(1:iDim,1:jDim) = this%inA(1:iDim,1:jDim) + this%inB(1:iDim,1:jDim)

Jim Dempsey

may_ka · ‎08-17-2017

Hi Jim,

thanks for looking into this. I tried

this%out(1:size(this%Out,1),1:size(this%Out,2))=&
      &this%inA(1:size(this%Out,1),1:size(this%Out,2))+&
      &this%inB(1:size(this%Out,1),1:size(this%Out,2))

but that did nothing to the processing time differences I found above.

Cheers

speed difference contiguous pointers vs call_by_reference