Performance penalty for types (type matrix versus arrays)

KNAnagnostopoulos · ‎05-31-2020

I define a very simple type for a matrix:

 type               MyMatrix
  integer                               :: n=0
  complex(dp), allocatable              :: v(:,:)
 end type           MyMatrix

Then, operations turn out to have a large overhead for small N x N matrices. For example, matrix multiplication (which I am mostly interested in)

type(MyMatrix) ::  m1, m2, m3

complex(dp), allocatable ::    a1(:,:), a2(:,:), a3(:,:)

m3 = m1 * m2;  m3 = MyMatmul(m1,m2); a3 = MyMatmul(a2,a3)

for N=60 gives about 1.6 performance penalty. For the more complex matrix type that I have in my real program, containing also procedure components, the penalty can be up to 2.5! The comparison is done using an interface that is simply passing m1 % v and m2 % v to BLAS's zgemm . The operator(*) is overloaded to MyMatmul (same performance as the function call).

The penalty reduces for large matrices, reaching 10% for N=1000 and 5% for N=4000. In my real program, the penalty remains 10% up to N=16000.

The performance is compared using wall time.

Is this unavoidable?

-------------------------------------------------------------------------------------------------------------

I am attaching the simplified source code that focuses on this particular problem

Benchmarks on i7-4790 CPU @ 3.60GHz, Ubuntu 18.04,

Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.5.281 Build 20190815

ifort -O3 -heap-arrays -xAVX  -parallel benchmarkMyMatrix.f90 -mkl=parallel 

export  OMP_NUM_THREADS=8
export  MKL_NUM_THREADS=8

and on the ARIS supercomputer (Intel Xeon E5-2680v2, https://hpc.grnet.gr/hardware/#hardware-overview )

compiler_version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.0.128 Build 20170811
compiler_options: -I/apps/compilers/intel/18.0.0/mkl/include -c -O3 -heap-arrays -xCORE-AVX-I -parallel -mkl=parallel

export  OMP_NUM_THREADS=20
export  MKL_NUM_THREADS=20

Indicative wall times for each operation: (times are in wall time seconds per matrix multiplication averaged over several thousand random matrices - less statistics for N=4000 - on the i7 mentioned above).

N     MatmulMyMatrix  MatmulArray2  ratio
30    0.63192E-05     0.38982E-05   1.62
60    0.21967E-04     0.13776E-04   1.59
120   0.10932E-03     0.79098E-04   1.38
190   0.36823E-03     0.30644E-03   1.20
250   0.78471E-03     0.65853E-03   1.19
500   0.64193E-02     0.53616E-02   1.20
1000  0.45260E-01     0.41173E-01   1.10
2000  0.34830         0.32136       1.08
4000  2.5440          2.4220        1.05

jimdempseyatthecove · ‎05-31-2020

Konstantinos,

Why do you use a user defined type that contains a 2D square array, and an integer holding the size of each dimension...
... as opposed to simply allocating the 2D square array and then using size(theArray,dim=1)?

Also note:

 function           mymatrix_construct(n) result(MAT)
  integer, intent(in)                   :: n
  type(MyMatrix)                        :: MAT

  allocate( MAT % v(n,n) )

  MAT % v = ZERO
 end function       mymatrix_construct

Is not initializing MAT % n = n

Then:

 subroutine         test_MatmulMyMatrix(N,delt,nops,dt_test)
  ...
  m1   = MyMatrix(N); m2 = MyMatrix(N); m3 = MyMatrix(N)
  ... **** m1%n == 0, m2%n==0, m3%n == 0 ***
   ...
   m3  = m1 * m2 ! *** m1%n == 0, m2%n==0 ***
   tf  = wtime(); dt = tf - ti; nops = nops + 1
  ...
 end subroutine     test_MatmulMyMatrix

And now the operator(*) function

 function           mymatrix_mult_mymatrix (MATA,MATB) result (MATC)
  class(MyMatrix), intent(in)           :: MATA, MATB
  type (MyMatrix)                       :: MATC

  MATC % n = MATA % n ! *** n == 0 ***

  allocate( MATC % v( MATA % n , MATA % n) ) ! *** zero sized array ***

  MATC % v = LMATMUL( MATA % v , MATB % v) ! *** likely perform a realloc_lhs without complaining

 end function       mymatrix_mult_mymatrix

Jim Dempsey

jimdempseyatthecove · ‎05-31-2020

my bad

>>m1 = MyMatrix(N); m2 = MyMatrix(N); m3 = MyMatrix(N) ...
>>**** m1%n == undefined, m2%n==undefined, m3%n == undefined ***

Jim Dempsey

KNAnagnostopoulos · ‎05-31-2020

Dear Jim, thank you for your comment. I uploaded the correct file, the results are correct in both versions, the error is harmless (now MAT %n is defined). The code was written for educational reasons, but I got curious about the performance penalty of OO features.

I tried to trim down the code of my program to focus on the question posed in the forum, the error does not exist in the original program. The original definition of the type is

 type               MatrixClass                                                                                                     !:.:.:
  integer                               :: m=0, n =0                                                                                !:.:.:! an m x n matrix M_ij;  m=0 indicates that it is not initiated
  integer                               :: is=0,ie=0,js=0,je=0                                                                      !:.:.:! i= is, is+1, ..., ie   j= js, js+1, ..., je, ie=is+m-1, je=js+n-1
  character(mtype_len)                  :: mtype='GG'                                                                               !:.:.:! Lapack convention: ZG, ZH, DG, DS, ... if GG: Not initialized
  character( name_len)                  :: name=''                                                                                  !:.:.:! name, further info
 contains
  procedure, private, pass(MAT)         :: matrix_save, matrix_read, matrix_print, matrix_random_set, matrix_gaussian_set                            
  generic                               :: save           => matrix_save                                                            !:.:.:! call MAT % save (unit) makes a raw save      in (optional) unit (default f_mout)  
  generic                               :: read           => matrix_read                                                            !:.:.:! call MAT % read (unit) reads matrix        from (optional) unit (default f_minput)    
  generic                               :: print          => matrix_print                                                           !:.:.:! call MAT % print(unit) makes pretty printing in (optional) unit (default f_mout)
  generic                               :: random         => matrix_random_set                                                      !:.:.:! call MAT % random      makes matrix random.  If mtype='ZH'/'DS', it is hermitian/symmetric
  generic                               :: gaussian       => matrix_gaussian_set                                                    !:.:.:
 end type           MatrixClass
!-----------------------------------------------------------------------------------------------------------------------------------
 type,      extends(MatrixClass)        :: Matrix                                                                                   !:.:.:! Complex(dp) matrices 
  complex(dp), allocatable              :: v(:,:)                                                                                   !:.:.:! MAT % v(is:ie, js:je): values of matrices
 contains
  procedure, private, pass(MAT)         :: matrix_hermitian_set, matrix_traceless_set
  procedure, private, pass(MATA)        :: matrix_return_transpose,   matrix_return_hermitian, matrix_return_conjg
  procedure, private, pass(MATA)        :: matrix_return_real_dmatrix,matrix_return_imag_dmatrix
  generic                               :: hermitian_set  => matrix_hermitian_set                                                   !:.:.:! call MAT % hermitian_set makes matrix hermitian and mtype 'ZH'
  generic                               :: traceless_set  => matrix_traceless_set
  generic                               :: conjg          => matrix_return_conjg                                                    !:.:.:! m2 = m1  % conjg    () returns the complex conjugate of matrix m1
  generic                               :: transpose      => matrix_return_transpose                                                !:.:.:! m2 = m1  % transpose() returns the transpose         of matrix m1
  generic                               :: hermitian      => matrix_return_hermitian                                                !:.:.:! m2 = m1  % hermitian() returns the hermitian         of matrix m1
  generic                               :: dagger         => matrix_return_hermitian                                                !:.:.:! m2 = m1  % hermitian() returns the hermitian         of matrix m1
  generic                               :: re             => matrix_return_real_dmatrix                                             !:.:.:
  generic                               :: im             => matrix_return_imag_dmatrix                                             !:.:.:
 end type           Matrix
 interface          Matrix                                                                                                          ! MAT=Matrix(m,n,is,js,mtype='ZG',name='MMAT')
  module procedure                      :: matrix_construct_zero  , matrix_construct_real, matrix_construct_complex                 ! MAT=Matrix(n,mtype='ZH'), MAT=Matrix(PI,n), MAT=Matrix(IMU,n)
  module procedure                      :: matrix_construct_random, matrix_construct_array2                                         ! MAT=Matrix('uniform',n,mtype='ZH'), MAT=Matrix('gaussian',n,sigma=PI), MAT=Matrix(C(:,:))
 end interface      Matrix

andrew_4619 · ‎05-31-2020

MATC % v = LMATMUL( MATA % v , MATB % v)

It would not surprise me if the compiler copies MATA%v and MATB%v into temporary arrays so the data is in contiguous memory for the LMATMUL. This could be the cause of the slowdown with a derived type. You would need to look at the assembly to see what instructions are generated.

KNAnagnostopoulos · ‎05-31-2020

Hi Andrew, thank you for your reply. The only thing that baffles me is, shouldn't this be the same for the allocatable arrays?

FortranFan · ‎05-31-2020

Konstantinos A. wrote:
.. The code was written for educational reasons, but I got curious about the performance penalty of OO features...

@Konstantinos A.,

For educational purposes, I contend no Fortran code for linear algebra is complete without a consideration for parameterized derived types (PDTs) introduced in Fortran standard since the 2003 revision. Besides, the code you show is needlessly complex. I suggest a simpler setup which first focuses on the intrinsic capabilities of the language itself as guided by its ISO IEC standard. Then it evaluate its performance. An example is shown below..

Floating-point precision:

module mykinds_m

   integer, parameter :: DP = selected_real_kind( p=12 )

end module

Wrap the matrix multiplication workhorse procedure of your choice that optimizes its handling such as conveying contiguous data, avoiding stack overflow, etc. Here I show with the intrinsic MATMUL:

module mult_m
   use mykinds_m, only : DP
contains
   pure subroutine mymult( a, b, c )
      real(DP), contiguous, intent(in) :: a(:,:)
      real(DP), contiguous, intent(in) :: b(:,:)
      real(DP), contiguous, intent(inout) :: c(:,:)
      ! Checks elided
      c = matmul( a, b )
   end subroutine
end module

A 'container class' for the types of your matrix and the methods that operate on the matrix elements, here I show with a parameterized derived type (PDT) for a square matrix with matrix multiplication:

module mat_m
   use mykinds_m, only : DP
   use mult_m, only : mymult
   type :: square_matrix_t(n)
      integer, len :: n = 2 ! 2x2 matrix by default
      real(DP) :: x(n,n)
   contains
      private
      procedure, pass(a) :: mult_mat
      generic, public :: operator(*) => mult_mat
   end type
contains
   elemental function mult_mat( a, b ) result( r )
      class(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)) :: r
      call mymult( a%x, b%x, r%x )
   end function
   elemental subroutine mult_mat_t( a, b, c )
      type(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)), intent(inout) :: c
      call mymult( a%x, b%x, c%x )
   end subroutine
end module

And a unit test with simple tests for CPU performance:

program p

   use mykinds_m, only : DP

   integer, parameter :: n = 500
   integer, parameter :: ntrial = 12
   integer :: i

   real(DP) :: cput(ntrial), t1, t2

   call random_seed()

   blk1: block
      use mult_m, only : mymult
      real(DP), allocatable :: a(:,:), b(:,:), c(:,:)
      allocate( a(n,n), b(n,n), c(n,n) )
      print *, "Block 1: Arrays with wrapper to intrinsic MATMUL"
      do i = 1, ntrial
         call random_number( a )
         call random_number( b )
         call my_cpu_time( t1 )
         call mymult( a, b, c)
         !print *, "a(1,1), b(1,1), c(1,1): ", a(1,1), b(1,1), c(1,1)
         call my_cpu_time( t2 )
         cput(i) = t2-t1
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP)
   end block blk1
   print *

   blk2: block
      use mat_m, only : square_matrix_t
      type(square_matrix_t(n=:)), allocatable :: a, b, c
      print *, "Block 2: PDT for a square matrix with a type-bound operator for multiplication"
      allocate( square_matrix_t(n=n) :: a, b, c )
      do i = 1, ntrial
         call random_number( a%x )
         call random_number( b%x )
         call my_cpu_time( t1 )
         c = a*b
         call my_cpu_time( t2 )
         !print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1)
         cput(i) = t2-t1
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP)
   end block blk2
   print *

   blk3: block
      use mat_m, only : square_matrix_t, mult_mat_t
      type(square_matrix_t(n=:)), allocatable :: a, b, c
      print *, "Block 3: PDT for a square matrix with type-specific MATMUL procedure"
      allocate( square_matrix_t(n=n) :: a, b, c )
      do i = 1, ntrial
         call random_number( a%x )
         call random_number( b%x )
         call my_cpu_time( t1 )
         call mult_mat_t( a, b, c )
         call my_cpu_time( t2 )
         !print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1)
         cput(i) = t2-t1
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP)
   end block blk3

   stop

contains

   subroutine my_cpu_time( time )

      use, intrinsic :: iso_fortran_env, only : I8 => int64

      !.. Argument list
      real(DP), intent(inout) :: time

      !.. Local variables

      integer(I8) :: tick
      integer(I8) :: rate

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine my_cpu_time

end program p

You can try this on your hardware environment(s) and report results. With Windows and an i7 laptop, the results are:

C:\Temp>ifort /O3 /standard-semantics /heap-arrays- /stand:f18 /warn:all p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.25.28612.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\Temp>p.exe
 Block 1: Arrays with wrapper to intrinsic MATMUL
 Average CPU Time:  2.209999561309815E-002

 Block 2: PDT for a square matrix with a type-bound operator for multiplication
 Average CPU Time:  2.599999904632568E-002

 Block 3: PDT for a square matrix with type-specific MATMUL procedure
 Average CPU Time:  2.079999446868896E-002

C:\Temp>p.exe
 Block 1: Arrays with wrapper to intrinsic MATMUL
 Average CPU Time:  2.169995307922363E-002

 Block 2: PDT for a square matrix with a type-bound operator for multiplication
 Average CPU Time:  2.600002288818359E-002

 Block 3: PDT for a square matrix with type-specific MATMUL procedure
 Average CPU Time:  2.220001220703125E-002

C:\Temp>

So you will notice the use of PDT with a type-specific procedure toward the operation of interest is similar to the intrinsic data type, a floating-point array of an arbitrary precision. You will notice the use of the aesthetic 'blackboard abstraction' in the form of 'c = a*b' incurs some performance penalty. This penalty is small in absolute terms but can be appreciable on a relative basis. My observation is the root cause has to do with the 'c = ..' copy operation involving the derived type as well as some overhead with the use of the type-bound procedure for the generic binding toward operator(*).

When such a penalty is unacceptable, don't use the fancy stuff which allows the so-called ADT calculus with user-defined types such as 'c = a*b'. Use simple subroutines with intrinsic data types such as REAL instead.

KNAnagnostopoulos · ‎06-01-2020

Dear Fortran Fan, thank you for taking the time to give me such a detailed response. I will try it in the next couple of days and let you know. One technical matter is that you use elemental/pure procedures and I want to use zgemm, which is not pure. Therefore I might have to try the difference of not using such procedures.

Steve_Lionel · ‎06-01-2020

I would like to see FortranFan's example run with more trials and some way of making sure that elements of the result c are used outside the timing. The times reported are a couple of hundredths of a second, which is at most one or two OS clock ticks, and the compiler could have eliminated the work since it didn't see it being used. It could be something as simple as adding an element of c to a variable that is then printed outside the loop.

I'm also skeptical of timing just the call to the various multiply operations. It would be better if the random_number calls were done outside the loop to prepopulate the a and b arrays, then the loop which accumulates results for c is timed. Even then I'd check the assembly to make sure that I understand what is being timed.

JohnNichols · ‎06-01-2020

Even then I'd check the assembly to make sure that I understand what is being timed.

Even when you are timing the same operations -- 10000 times per day -- the results will move all over the place, we track the loop time for our main routines and load the results to a SQL server, the results show the type of distribution one would expect. The computer is a machine but it does vary.

FortranFan · ‎06-01-2020

Steve Lionel (Ret.) (Blackbelt) wrote:
I would like to see FortranFan's example run with more trials and some way of making sure that elements of the result c are used outside the timing. The times reported are a couple of hundredths of a second, which is at most one or two OS clock ticks, and the compiler could have eliminated the work since it didn't see it being used. It could be something as simple as adding an element of c to a variable that is then printed outside the loop.
I'm also skeptical of timing just the call to the various multiply operations. It would be better if the random_number calls were done outside the loop to prepopulate the a and b arrays, then the loop which accumulates results for c is timed. Even then I'd check the assembly to make sure that I understand what is being timed.

It'll be interesting to see what OP (and any other reader) find(s) and report(s) here. I had instinctively and immediately thought about and factored all the aspects mentioned in the above post and I deliberately left the example the way I did to see how OP (and any other reader) build on it.

Theoretically the FLOP instructions with matrix multiplication are O(n**3) and as such, the operation timed here is indeed a good case for the topic of interest to OP, "Performance penalty for type matrix versus arrays."

KNAnagnostopoulos · ‎06-02-2020

I tried to run FortranFan's suggestion but I failed. I am getting (random) segmentation faults. Once I was able to run it and I got the (admittedly limited) results:

#-----------------------------------------------------------------------------------------------------
#my:       N      MatmulMyMatrix    MatmulArray2       FixedArray2      MatmulMatrix
#-----------------------------------------------------------------------------------------------------
my        30       0.13910E-04       0.65755E-05       0.65784E-05       0.21848E-04
my        60       0.56022E-04       0.31077E-04       0.30943E-04       0.88976E-04
my       120       0.29360E-03       0.19568E-03       0.19771E-03       0.41697E-03
my       190       0.10149E-02       0.74705E-03       0.76242E-03       0.13023E-02
my       250       0.22897E-02       0.17427E-02       0.17313E-02       0.27372E-02

The last column is FortranFan's suggestion, which seems to be even slower than the naive type(MyMatrix). In order to conduct more tests, I need to overcome the SIGSEGV fault. The FixedArray2 concerns statically allocated arrays and a direct call to zgemm.

When I am examining the core file, the problem seems to be the deallocation of the types:

#14 0x00000000004419bd in for_dealloc_allocatable ()
#15 0x00000000004174f0 in do_deallocate_all ()
#16 0x000000000041723d in do_deallocate_all ()
#17 0x000000000040c1ed in matrix_benchmark::test_matmulmatrix (n=30, delt=1, nops=44314, dt_test=1) at benchmarkMyMatrix.f90:188

I don't know if I am doing something trivially wrong, or if this a bug of the ifort version that I am using.

I am compiling with

ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn -mkl benchmarkMyMatrixUutilities.f90 benchmarkMyMatrix.f90

(and various other trivial changes). The details of the error stack, as well as my environment and source code, can be found in the attached README file.

NOTE: Please don't compare the above mentioned times with the ones I reported on my post originally. The times were obtained on my laptop and not on the machine used originally. I was not able to get such a report from the other machines due to the SEGV.

Concerning letting the overhead of random_number add to the wall time:

a. the calls are only 5, independent of N, so the contribution becomes negligible for large N

b. the times should be used for comparison between the different methods. The contribution of random_number is the same for every method (Statistically speaking).

FortranFan · ‎06-02-2020

Konstantinos A. wrote:
I tried to run FortranFan's suggestion but I failed. ..

Please first can you take the example as-is and run on your system and report results here?

Shown below is a somewhat modified case, can you please try that? Please make no code changes, you'll note N = 500. It'll be great if you can first try this one case.

Please use the equivalent of Intel Fortran compiler options toward O3 optimization and standard-semantics, so the command might be ifort -O3 -standard-semantics <filename>

module mykinds_m

   integer, parameter :: DP = selected_real_kind( p=12 )

end module
 
module mult_m
   use mykinds_m, only : DP
contains
   pure subroutine mymult( a, b, c )
      real(DP), contiguous, intent(in) :: a(:,:)
      real(DP), contiguous, intent(in) :: b(:,:)
      real(DP), contiguous, intent(inout) :: c(:,:)
      ! Checks elided
      c = matmul( a, b )
   end subroutine
end module

module mat_m
   use mykinds_m, only : DP
   use mult_m, only : mymult
   type :: square_matrix_t(n)
      integer, len :: n = 2 ! 2x2 matrix by default
      real(DP) :: x(n,n)
   contains
      private
      procedure, pass(a) :: mult_mat
      generic, public :: operator(*) => mult_mat
   end type
contains
   elemental function mult_mat( a, b ) result( r )
      class(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)) :: r
      call mymult( a%x, b%x, r%x )
   end function
   elemental subroutine mult_mat_t( a, b, c )
      type(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)), intent(inout) :: c
      call mymult( a%x, b%x, c%x )
   end subroutine
end module

program p

   use mykinds_m, only : DP

   integer, parameter :: n = 500
   integer, parameter :: ntrial = 12
   integer :: i

   real(DP) :: cput(ntrial), t1, t2

   call random_seed()

   blk1: block
      use mult_m, only : mymult
      real(DP), allocatable :: a(:,:), b(:,:), c(:,:)
      allocate( a(n,n), b(n,n), c(n,n) )
      print *, "Block 1: Arrays with wrapper to intrinsic MATMUL"
      do i = 1, ntrial
         if ( i == ntrial ) then
            call imat( a )
            call imat( b )
         else
            call random_number( a )
            call random_number( b )
         end if
         call my_cpu_time( t1 )
         call mymult( a, b, c)
         call my_cpu_time( t2 )
         cput(i) = t2-t1
         if ( i == 1 ) then
            print *, "First iteration: "
            print *, "a(1,1), b(1,1), c(1,1): ", a(1,1), b(1,1), c(1,1)
            t2 = sum( c )
            print *, "sum(c) = ", t2
         end if
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") 
      print *, "Last iteration: check using identity matrix"
      print *, "a(n,n), b(n,n), c(n,n): ", a(n,n), b(n,n), c(n,n)
   end block blk1
   print *

   blk2: block
      use mat_m, only : square_matrix_t
      type(square_matrix_t(n=:)), allocatable :: a, b, c
      print *, "Block 2: PDT for a square matrix with a type-bound operator for multiplication"
      allocate( square_matrix_t(n=n) :: a, b, c )
      do i = 1, ntrial
         if ( i == ntrial ) then
            call imat( a%x )
            call imat( b%x )
         else
            call random_number( a%x )
            call random_number( b%x )
         end if
         call my_cpu_time( t1 )
         c = a*b
         call my_cpu_time( t2 )
         cput(i) = t2-t1
         if ( i == 1 ) then
            print *, "First iteration: "
            print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1)
            t2 = sum( c%x )
            print *, "sum(c%x) = ", t2
         end if
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") 
      print *, "Last iteration: check using identity matrix"
      print *, "a%x(n,n), b%x(n,n), c%x(n,n): ", a%x(n,n), b%x(n,n), c%x(n,n)
   end block blk2
   print *

   blk3: block
      use mat_m, only : square_matrix_t, mult_mat_t
      type(square_matrix_t(n=:)), allocatable :: a, b, c
      print *, "Block 3: PDT for a square matrix with type-specific MATMUL procedure"
      allocate( square_matrix_t(n=n) :: a, b, c )
      do i = 1, ntrial
         if ( i == ntrial ) then
            call imat( a%x )
            call imat( b%x )
         else
            call random_number( a%x )
            call random_number( b%x )
         end if
         call my_cpu_time( t1 )
         call mult_mat_t( a, b, c )
         call my_cpu_time( t2 )
         cput(i) = t2-t1
         if ( i == 1 ) then
            print *, "First iteration: "
            print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1)
            t2 = sum( c%x )
            print *, "sum(c%x) = ", t2
         end if
      end do
      ! Calc average time excluding best and worst times
      t1 = sum(cput) - maxval(cput) - minval(cput)
      print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") 
      print *, "Last iteration: check using identity matrix"
      print *, "a%x(n,n), b%x(n,n), c%x(n,n): ", a%x(n,n), b%x(n,n), c%x(n,n)
   end block blk3

   stop

contains

   subroutine my_cpu_time( time )

      use, intrinsic :: iso_fortran_env, only : I8 => int64

      !.. Argument list
      real(DP), intent(inout) :: time

      !.. Local variables

      integer(I8) :: tick
      integer(I8) :: rate

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine my_cpu_time

   subroutine imat( x )
      real(DP), intent(inout) :: x(:,:)
      integer :: i
      x = 0.0_dp
      do i = 1, size(x,dim=1)
         x(i,i) = 1.0_dp
      end do
   end subroutine 

end program p

Sample output based on this code is as follows:

C:\Temp>ifort /standard-semantics /O3 /warn:all /stand:f18 p.f90 -o p.exe
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.25.28612.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\Temp>p.exe
 Block 1: Arrays with wrapper to intrinsic MATMUL
 First iteration:
 a(1,1), b(1,1), c(1,1):  0.427986685828709 0.256966050640733
 120.717640411730
 sum(c) =  31217750.4781149
 Average CPU Time:  2.570002079010010E-002

 Last iteration: check using identity matrix
 a(n,n), b(n,n), c(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000

 Block 2: PDT for a square matrix with a type-bound operator for multiplication
 First iteration:
 a%x(1,1), b%x(1,1), c%x(1,1):  0.748272345682208 0.580942401839468
 121.718967899213
 sum(c%x) =  31198625.6667619
 Average CPU Time:  2.649998664855957E-002

 Last iteration: check using identity matrix
 a%x(n,n), b%x(n,n), c%x(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000

 Block 3: PDT for a square matrix with type-specific MATMUL procedure
 First iteration:
 a%x(1,1), b%x(1,1), c%x(1,1):  0.549035960188161 0.920795003076816
 127.775432983094
 sum(c%x) =  31241430.1548635
 Average CPU Time:  1.960003376007080E-002

 Last iteration: check using identity matrix
 a%x(n,n), b%x(n,n), c%x(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000

C:\Temp>

KNAnagnostopoulos · ‎06-02-2020

Dear FortranFan, your code runs seemlessly:

 Block 1: Arrays with wrapper to intrinsic MATMUL
 First iteration: 
 a(1,1), b(1,1), c(1,1):  0.414588091541075 0.573363950353085
 131.731703981318
 sum(c) =  31299857.5604336
 Average CPU Time:  2.420210838317871E-002 

 Last iteration: check using identity matrix
 a(n,n), b(n,n), c(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000
 
 Block 2: PDT for a square matrix with a type-bound operator for multiplication
 First iteration: 
 a%x(1,1), b%x(1,1), c%x(1,1):  0.976950374450897 0.904038053398594
 128.281468064703
 sum(c%x) =  31216491.4142846
 Average CPU Time:  2.745497226715088E-002 

 Last iteration: check using identity matrix
 a%x(n,n), b%x(n,n), c%x(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000
 
 Block 3: PDT for a square matrix with type-specific MATMUL procedure
 First iteration: 
 a%x(1,1), b%x(1,1), c%x(1,1):  0.168759306587530 0.847564522196995
 126.620520759212
 sum(c%x) =  31288743.1630736
 Average CPU Time:  2.389459609985351E-002 

 Last iteration: check using identity matrix
 a%x(n,n), b%x(n,n), c%x(n,n):  1.00000000000000 1.00000000000000
 1.00000000000000

jimdempseyatthecove · ‎06-02-2020

FF>

elemental function mult_mat( a, b ) result( r )
   class(square_matrix_t(n=*)), intent(in) :: a
   type(square_matrix_t(n=a%n)), intent(in) :: b
   type(square_matrix_t(n=a%n)) :: r
 ...
    c = a*b

As used by operator(*), a, b, and c are scalars of type square_matrix (not an array of type square_matrix).

Jim Dempsey

FortranFan · ‎06-02-2020

jimdempseyatthecove (Blackbelt) wrote:
FF>
elemental function mult_mat( a, b ) result( r )
   class(square_matrix_t(n=*)), intent(in) :: a
   type(square_matrix_t(n=a%n)), intent(in) :: b
   type(square_matrix_t(n=a%n)) :: r
 ...
    c = a*b
As used by operator(*), a, b, and c are scalars of type square_matrix (not an array of type square_matrix).

Jim Dempsey

Jim, not sure I understand - note you can try the operation with arrays if you like, here's a unit test as a starting point:

program p

   use mykinds_m, only : DP

   real(DP), parameter :: EPS_DP = epsilon( 0.0_dp )
   integer, parameter :: n = 3
   integer :: i

   call random_seed()

   blk: block
      use mat_m, only : square_matrix_t
      type(square_matrix_t(n=:)), allocatable :: a(:), b(:), c(:)
      print *, "Elemental C = AB multiplication for arrays of PDT toward a square matrix"
      allocate( square_matrix_t(n=n) :: a(n), b(n), c(n) )
      do i = 1, n
         call imat( a(i)%x )
         call random_number( b(i)%x )
      end do

      c = a*b !<-- Elemental multiplication of PDT arrays 

      print *, "First element: "
      print *, "a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1): ", a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1)

      chk: do i = 1, n
         if ( any( abs(c(i)%x-b(i)%x) > EPS_DP ) ) then
            print *, "Error: C /= B following C = IB multiplication for b(", i, ")"
            exit chk
         end if 
      end do chk

   end block blk
   print *

   stop

contains

   subroutine imat( x )
      real(DP), intent(inout) :: x(:,:)
      integer :: i
      x = 0.0_dp
      do i = 1, size(x,dim=1)
         x(i,i) = 1.0_dp
      end do
   end subroutine 

end program p

This example goes with the same "library" code:

module mykinds_m

   integer, parameter :: DP = selected_real_kind( p=12 )

end module
 
module mult_m
   use mykinds_m, only : DP
contains
   pure subroutine mymult( a, b, c )
      real(DP), contiguous, intent(in) :: a(:,:)
      real(DP), contiguous, intent(in) :: b(:,:)
      real(DP), contiguous, intent(inout) :: c(:,:)
      ! Checks elided
      c = matmul( a, b )
   end subroutine
end module

module mat_m
   use mykinds_m, only : DP
   use mult_m, only : mymult
   type :: square_matrix_t(n)
      integer, len :: n = 2 ! 2x2 matrix by default
      real(DP) :: x(n,n)
   contains
      private
      procedure, pass(a) :: mult_mat
      generic, public :: operator(*) => mult_mat
   end type
contains
   elemental function mult_mat( a, b ) result( r )
      class(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)) :: r
      call mymult( a%x, b%x, r%x )
   end function
   elemental subroutine mult_mat_t( a, b, c )
      type(square_matrix_t(n=*)), intent(in)  :: a
      type(square_matrix_t(n=a%n)), intent(in) :: b
      type(square_matrix_t(n=a%n)), intent(inout) :: c
      call mymult( a%x, b%x, c%x )
   end subroutine
end module

Here's sample output:

C:\Temp>ifort /standard-semantics /warn:all /stand:f18 p.f90 -o p.exe
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.25.28612.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\Temp>p.exe
 Elemental C = AB multiplication for arrays of PDT toward a square matrix
 First element:
 a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1):  1.00000000000000
 0.294981195625570 0.294981195625570


C:\Temp>

FortranFan · ‎06-02-2020

Konstantinos A. wrote:
Dear FortranFan, your code runs seemlessly: ..

Ok, good. Now you can try a few things in this second example if you like:

Try a few larger sizes of N up to a reasonable size depending on your system configuration e.g., whether you've 64 GB or 32 GB environment vs 16 or 8 GB. As you know, if N becomes too large, you'll be dealing with virtual memory management and its impact on performance which can have a lot more variance from run to run.
Different values of ntrial and whether they make a difference, of course as checked over multiple runs at same values of n and ntrial,
Swapping intrinsic MATMUL with an algorithm of your choice; you may need to remove ELEMENTAL on 'mult_mat*' or apply "IMPURE",

You can then compare the times of the 3 approaches: with arrays, with derived types and a subroutine, and derived types with defined operations.

jimdempseyatthecove · ‎06-02-2020

FF>> Your post #16 does show use of an array of type square_matrix_t, and thus an appropriate use of elemental attribute of the function.

The original post showed no need for an array of type square_matrix_t, but he did state a preference for using zgemm. In which case use of both ELEMENTAL and IMPURE might be desired.

Jim Dempsey

KNAnagnostopoulos · ‎06-02-2020

Dear FortranFan,

I have had some problems with your code:
1. When compiling with
ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn
with n,ntrial = 500, 10000
It ran for 103 minutes (!!) I got an error that produced a core file (see attached benchmarkFortranFan_v01.out for errors)

2. When compiling ifort -O3 -heap-arrays -standard-semantics -xAVX -parallel
It ran for about 2 minutes and terminated with signal SIGSEGV (see attached benchmarkFortranFan_v02.out for errors)

3. When compiled with
ifort -O3 -heap-arrays -standard-semantics benchmarkFortranFan.f90 -o benchff
it finished normally after about 6 minutes (output in benchmarkFortranFan_v03.out)

4. I repeated 3. using ntrial = 100000 and it finished normally 109 minutes
(output benchmarkFortranFan_v04.out)

FortranFan · ‎06-02-2020

Konstantinos A. wrote:
.. I have had some problems with your code:
1. When compiling with
ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn
with n,ntrial = 500, 10000
It ran for 103 minutes (!!) I got an error that produced a core file (see attached benchmarkFortranFan_v01.out for errors)
2. When compiling ifort -O3 -heap-arrays -standard-semantics -xAVX -parallel
It ran for about 2 minutes and terminated with signal SIGSEGV (see attached benchmarkFortranFan_v02.out for errors) ..

1. When compiling with -g compiler option to generate debug information and which disables optimization by default, setting 'n' and 'ntrial' to such large numbers makes little sense. If your purpose here is to step through the code in debug mode, why not try smaller n and ntrial? Even with n and ntrial in single digits will give you a good idea of the code while allowing you to complete the instructions in a reasonable timeframe. Note with -g option, the timing results are mostly meaningless.

2. Not sure what -parallel really does for you in this case, you need to consult Intel documentation and perhaps their online support if your terms allow. Same with -xAVX. You may want to look into -xHost option instead, or better yet consult with Intel re: this option and what it does with MATMUL intrinsic or Intel MKL ZGEMM, etc.

Thanks for your followup, glad you got 3 and 4 working. However I do think a reasonably small value for ntrial, say in the 12 to 24 range, is sufficient to get a good estimate of the median performance demand with a given compute approach. I don't agree with Quote #9. Anyways, a value of 10,000 you're using for ntrial looks like an overkill.

Your results with 3 and 4 suggest the same as before, a derived type whose components are arrays can be coded up to yield about the same performance as the most ordinary of the 'containers' in Fortran i.e., the arrays themselves.

The key point is to be careful with data fetches on the components of the derived type, as shown in the example here. The advantage with the derived type, as you know, is the consumer of said type is working with a higher level abstraction that can be 'far closer' than with ordinary arrays to 'domain knowledge' and expertise, in this case linear algebra. This is in addition to other advantages in terms of data encapsulation, information hiding, improved maintenance, readability and ease-of-use of working with the 'library' code.

But now, if one takes this further and brings the abstractions even closer to the ideal of 'For(mula) Tran(slation)' via the use of defined operations that allows writing compact 'equations' such as 'C = A*B', there is a bit of a price to be paid. But this is assuming the library is designed well; note poorly constructed code involving defined operations can really impact performance negatively Library developers and library consumers have to decide on this performance aspect relative to their needs.

KNAnagnostopoulos · ‎06-05-2020

Some updates:

1. With Fortran 19.1.1.217, I got rid of the segmentation faults for the parameterized derived types I reported above (ifort 19.0.5.281, -O3 -heap-arrays -xHost -parallel -mkl=parallel). The -g option is problematic as before.
2. I made a comparison of parameterized derived types (FortranFan's suggestion) versus ordinary types, allocatable arrays and statically allocated arrays. When using matrix multiplication via a subroutine, there is no noticeable time difference for N>=30.
3. If one uses operator overloading, the simple type is faster than the parameterized type for, more or less, N<=1000.
4. For ordinary allocatable arrays, using a function or subroutine for zgemm is the same, as is for the statically allocated arrays.
5. The intrinsic MATMUL is slower for matrix-matrix multiplication than using zgemm for N < 300. The overhead max ranges from 50% to 80% depending on processor and ifort version (see attached array_benchark.README)
6. Overloading multiplication for allocatable arrays (operator(.mm.), such that m3 = m1 .mm. m2) for arrays puts a less than 10% overhead for N<=500 (desktop) and N<= 2000 (supercomputer)
Maximum overhead is 20% for desktop (N=60) and 40% for supercomputer (N=120) (see attached array_benchark.README)
7. Expected scaling (~ N**3 for matrix-matrix ~N**2 for matrix-vector) sets in for N>=N0, where N0=120-750 for matrix-matrix and N0=1000-2000 for matrix-vector depending on compiler version and processor (see attached array_benchark.README)

The following table was produced by using the attached program: (see also attached file table.txt)

----------------------------------------------------------------------------------------------------------------------------------------
         N          (1)               (2)               (3)               (4)               (5)               (6)               (7)
----------------------------------------------------------------------------------------------------------------------------------------
        30       0.61541E-05       0.36631E-05       0.37419E-05       0.16168E-04       0.36311E-05       0.36501E-05       0.36273E-05
        60       0.22306E-04       0.13665E-04       0.13636E-04       0.59970E-04       0.13441E-04       0.13573E-04       0.13446E-04
       120       0.10719E-03       0.77839E-04       0.77894E-04       0.26082E-03       0.78003E-04       0.78610E-04       0.77942E-04
       190       0.36490E-03       0.28585E-03       0.30193E-03       0.74019E-03       0.28684E-03       0.28714E-03       0.28760E-03
       250       0.79697E-03       0.65101E-03       0.64590E-03       0.14293E-02       0.64645E-03       0.64529E-03       0.64487E-03
       500       0.63319E-02       0.53301E-02       0.53227E-02       0.83781E-02       0.52751E-02       0.53102E-02       0.53149E-02
      1000       0.45123E-01       0.41333E-01       0.41729E-01       0.52862E-01       0.40963E-01       0.40964E-01       0.40444E-01
      2000       0.36451           0.33073           0.33932           0.37915           0.33874           0.33724           0.33281    
      4000        2.5571            2.4191            2.4375            2.6045            2.4057            2.4154            2.4033    

Compiled with:
ifort -O3 -heap-arrays -xAVX  -parallel benchmarkMyMatrixUutilities.f90 benchmarkMyMatrix.f90 -mkl=parallel

(1) type MyMatrix:         m3 = m1 * m2           (2) allocatable array:  m3 = lmatmul(m1,m2)      (3) static arrays: direct call to zgemm
(4) parameterized type:    m3 = m1 * m2           (5) parameterized type:  all smatmul(m1,m2,m3)   (6) type MyMatrix: call smatmul(m1,m2,m3)
---------------------------------------------------------------------------------------------------------
(1) Matrix       -> (*)         -> matrix_matmul              (f) -> sub_array_matmul_array  (s) -> zgemm
(2) array        -> lmatmul (f) -> array_matmul_array         (f) -> zgemm
(3) static array -> zgemm
(4) MyMatrix     -> (*)         -> mymatrix_mult_mymatrix     (f) -> lmatmul                 (f) -> array_matmul_array (f) -> zgemm
(5) Matrix       -> smatmul (s) -> sub_matrix_matmul          (s) -> sub_array_matmul_array  (s) -> zgemm
(6) MyMatrix     -> smatmul (s) -> mymatrix_mult_mymatrix_sub (s) -> sub_array_matmul_array  (s) -> zgemm
(7) array        -> smatmul (s) -> sub_array_matmul_array     (s) -> zgemm