- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I define a very simple type for a matrix:

type MyMatrix integer :: n=0 complex(dp), allocatable :: v(:,:) end type MyMatrix

Then, operations turn out to have a large overhead for small N x N matrices. For example, matrix multiplication (which I am mostly interested in)

type(MyMatrix) :: m1, m2, m3 complex(dp), allocatable :: a1(:,:), a2(:,:), a3(:,:) m3 = m1 * m2; m3 = MyMatmul(m1,m2); a3 = MyMatmul(a2,a3)

for N=60 gives about 1.6 performance penalty. For the more complex matrix type that I have in my real program, containing also procedure components, the penalty can be up to 2.5! The comparison is done using an interface that is simply passing m1 % v and m2 % v to BLAS's zgemm . The operator(*) is overloaded to MyMatmul (same performance as the function call).

The penalty reduces for large matrices, reaching 10% for N=1000 and 5% for N=4000. In my real program, the penalty remains 10% up to N=16000.

The performance is compared using wall time.

Is this unavoidable?

-------------------------------------------------------------------------------------------------------------

I am attaching the simplified source code that focuses on this particular problem

Benchmarks on i7-4790 CPU @ 3.60GHz, Ubuntu 18.04,

Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.5.281 Build 20190815

ifort -O3 -heap-arrays -xAVX -parallel benchmarkMyMatrix.f90 -mkl=parallel export OMP_NUM_THREADS=8 export MKL_NUM_THREADS=8

and on the ARIS supercomputer (Intel Xeon E5-2680v2, https://hpc.grnet.gr/hardware/#hardware-overview )

compiler_version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.0.128 Build 20170811 compiler_options: -I/apps/compilers/intel/18.0.0/mkl/include -c -O3 -heap-arrays -xCORE-AVX-I -parallel -mkl=parallel export OMP_NUM_THREADS=20 export MKL_NUM_THREADS=20

Indicative wall times for each operation: (times are in wall time seconds per matrix multiplication averaged over several thousand random matrices - less statistics for N=4000 - on the i7 mentioned above).

N MatmulMyMatrix MatmulArray2 ratio 30 0.63192E-05 0.38982E-05 1.62 60 0.21967E-04 0.13776E-04 1.59 120 0.10932E-03 0.79098E-04 1.38 190 0.36823E-03 0.30644E-03 1.20 250 0.78471E-03 0.65853E-03 1.19 500 0.64193E-02 0.53616E-02 1.20 1000 0.45260E-01 0.41173E-01 1.10 2000 0.34830 0.32136 1.08 4000 2.5440 2.4220 1.05

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Konstantinos,

Why do you use a user defined type that contains a 2D square array, and an integer holding the size of each dimension...

... as opposed to simply allocating the 2D square array and then using size(theArray,dim=1)?

Also note:

function mymatrix_construct(n) result(MAT) integer, intent(in) :: n type(MyMatrix) :: MAT allocate( MAT % v(n,n) ) MAT % v = ZERO end function mymatrix_construct

Is not initializing MAT % n = n

Then:

subroutine test_MatmulMyMatrix(N,delt,nops,dt_test) ... m1 = MyMatrix(N); m2 = MyMatrix(N); m3 = MyMatrix(N) ... **** m1%n == 0, m2%n==0, m3%n == 0 *** ... m3 = m1 * m2 ! *** m1%n == 0, m2%n==0 *** tf = wtime(); dt = tf - ti; nops = nops + 1 ... end subroutine test_MatmulMyMatrix

And now the operator(*) function function mymatrix_mult_mymatrix (MATA,MATB) result (MATC) class(MyMatrix), intent(in) :: MATA, MATB type (MyMatrix) :: MATC MATC % n = MATA % n ! *** n == 0 *** allocate( MATC % v( MATA % n , MATA % n) ) ! *** zero sized array *** MATC % v = LMATMUL( MATA % v , MATB % v) ! *** likely perform a realloc_lhs without complaining end function mymatrix_mult_mymatrix

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

my bad

>>m1 = MyMatrix(N); m2 = MyMatrix(N); m3 = MyMatrix(N) ...

>>**** m1%n == undefined, m2%n==undefined, m3%n == undefined ***

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Jim, thank you for your comment. I uploaded the correct file, the results are correct in both versions, the error is harmless (now MAT %n is defined). The code was written for educational reasons, but I got curious about the performance penalty of OO features.

I tried to trim down the code of my program to focus on the question posed in the forum, the error does not exist in the original program. The original definition of the type is

type MatrixClass !:.:.: integer :: m=0, n =0 !:.:.:! an m x n matrix M_ij; m=0 indicates that it is not initiated integer :: is=0,ie=0,js=0,je=0 !:.:.:! i= is, is+1, ..., ie j= js, js+1, ..., je, ie=is+m-1, je=js+n-1 character(mtype_len) :: mtype='GG' !:.:.:! Lapack convention: ZG, ZH, DG, DS, ... if GG: Not initialized character( name_len) :: name='' !:.:.:! name, further info contains procedure, private, pass(MAT) :: matrix_save, matrix_read, matrix_print, matrix_random_set, matrix_gaussian_set generic :: save => matrix_save !:.:.:! call MAT % save (unit) makes a raw save in (optional) unit (default f_mout) generic :: read => matrix_read !:.:.:! call MAT % read (unit) reads matrix from (optional) unit (default f_minput) generic :: print => matrix_print !:.:.:! call MAT % print(unit) makes pretty printing in (optional) unit (default f_mout) generic :: random => matrix_random_set !:.:.:! call MAT % random makes matrix random. If mtype='ZH'/'DS', it is hermitian/symmetric generic :: gaussian => matrix_gaussian_set !:.:.: end type MatrixClass !----------------------------------------------------------------------------------------------------------------------------------- type, extends(MatrixClass) :: Matrix !:.:.:! Complex(dp) matrices complex(dp), allocatable :: v(:,:) !:.:.:! MAT % v(is:ie, js:je): values of matrices contains procedure, private, pass(MAT) :: matrix_hermitian_set, matrix_traceless_set procedure, private, pass(MATA) :: matrix_return_transpose, matrix_return_hermitian, matrix_return_conjg procedure, private, pass(MATA) :: matrix_return_real_dmatrix,matrix_return_imag_dmatrix generic :: hermitian_set => matrix_hermitian_set !:.:.:! call MAT % hermitian_set makes matrix hermitian and mtype 'ZH' generic :: traceless_set => matrix_traceless_set generic :: conjg => matrix_return_conjg !:.:.:! m2 = m1 % conjg () returns the complex conjugate of matrix m1 generic :: transpose => matrix_return_transpose !:.:.:! m2 = m1 % transpose() returns the transpose of matrix m1 generic :: hermitian => matrix_return_hermitian !:.:.:! m2 = m1 % hermitian() returns the hermitian of matrix m1 generic :: dagger => matrix_return_hermitian !:.:.:! m2 = m1 % hermitian() returns the hermitian of matrix m1 generic :: re => matrix_return_real_dmatrix !:.:.: generic :: im => matrix_return_imag_dmatrix !:.:.: end type Matrix interface Matrix ! MAT=Matrix(m,n,is,js,mtype='ZG',name='MMAT') module procedure :: matrix_construct_zero , matrix_construct_real, matrix_construct_complex ! MAT=Matrix(n,mtype='ZH'), MAT=Matrix(PI,n), MAT=Matrix(IMU,n) module procedure :: matrix_construct_random, matrix_construct_array2 ! MAT=Matrix('uniform',n,mtype='ZH'), MAT=Matrix('gaussian',n,sigma=PI), MAT=Matrix(C(:,:)) end interface Matrix

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

MATC % v = LMATMUL( MATA % v , MATB % v)

It would not surprise me if the compiler copies MATA%v and MATB%v into temporary arrays so the data is in contiguous memory for the LMATMUL. This could be the cause of the slowdown with a derived type. You would need to look at the assembly to see what instructions are generated.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Andrew, thank you for your reply. The only thing that baffles me is, shouldn't this be the same for the allocatable arrays?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Konstantinos A. wrote:.. The code was written for educational reasons, but I got curious about the performance penalty of OO features...

@Konstantinos A.,

For educational purposes, I contend no Fortran code for linear algebra is complete without a consideration for parameterized derived types (PDTs) introduced in Fortran standard since the 2003 revision. Besides, the code you show is needlessly complex. I suggest a simpler setup which first focuses on the intrinsic capabilities of the language itself as guided by its ISO IEC standard. Then it evaluate its performance. An example is shown below..

Floating-point precision:

module mykinds_m integer, parameter :: DP = selected_real_kind( p=12 ) end module

Wrap the matrix multiplication workhorse procedure of your choice that optimizes its handling such as conveying contiguous data, avoiding stack overflow, etc. Here I show with the intrinsic MATMUL:

module mult_m use mykinds_m, only : DP contains pure subroutine mymult( a, b, c ) real(DP), contiguous, intent(in) :: a(:,:) real(DP), contiguous, intent(in) :: b(:,:) real(DP), contiguous, intent(inout) :: c(:,:) ! Checks elided c = matmul( a, b ) end subroutine end module

A 'container class' for the types of your matrix and the methods that operate on the matrix elements, here I show with a parameterized derived type (PDT) for a square matrix with matrix multiplication:

module mat_m use mykinds_m, only : DP use mult_m, only : mymult type :: square_matrix_t(n) integer, len :: n = 2 ! 2x2 matrix by default real(DP) :: x(n,n) contains private procedure, pass(a) :: mult_mat generic, public :: operator(*) => mult_mat end type contains elemental function mult_mat( a, b ) result( r ) class(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)) :: r call mymult( a%x, b%x, r%x ) end function elemental subroutine mult_mat_t( a, b, c ) type(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)), intent(inout) :: c call mymult( a%x, b%x, c%x ) end subroutine end module

And a unit test with simple tests for CPU performance:

program p use mykinds_m, only : DP integer, parameter :: n = 500 integer, parameter :: ntrial = 12 integer :: i real(DP) :: cput(ntrial), t1, t2 call random_seed() blk1: block use mult_m, only : mymult real(DP), allocatable :: a(:,:), b(:,:), c(:,:) allocate( a(n,n), b(n,n), c(n,n) ) print *, "Block 1: Arrays with wrapper to intrinsic MATMUL" do i = 1, ntrial call random_number( a ) call random_number( b ) call my_cpu_time( t1 ) call mymult( a, b, c) !print *, "a(1,1), b(1,1), c(1,1): ", a(1,1), b(1,1), c(1,1) call my_cpu_time( t2 ) cput(i) = t2-t1 end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP) end block blk1 print * blk2: block use mat_m, only : square_matrix_t type(square_matrix_t(n=:)), allocatable :: a, b, c print *, "Block 2: PDT for a square matrix with a type-bound operator for multiplication" allocate( square_matrix_t(n=n) :: a, b, c ) do i = 1, ntrial call random_number( a%x ) call random_number( b%x ) call my_cpu_time( t1 ) c = a*b call my_cpu_time( t2 ) !print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1) cput(i) = t2-t1 end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP) end block blk2 print * blk3: block use mat_m, only : square_matrix_t, mult_mat_t type(square_matrix_t(n=:)), allocatable :: a, b, c print *, "Block 3: PDT for a square matrix with type-specific MATMUL procedure" allocate( square_matrix_t(n=n) :: a, b, c ) do i = 1, ntrial call random_number( a%x ) call random_number( b%x ) call my_cpu_time( t1 ) call mult_mat_t( a, b, c ) call my_cpu_time( t2 ) !print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1) cput(i) = t2-t1 end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP) end block blk3 stop contains subroutine my_cpu_time( time ) use, intrinsic :: iso_fortran_env, only : I8 => int64 !.. Argument list real(DP), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine my_cpu_time end program p

You can try this on your hardware environment(s) and report results. With Windows and an i7 laptop, the results are:

C:\Temp>ifort /O3 /standard-semantics /heap-arrays- /stand:f18 /warn:all p.f90 Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306 Copyright (C) 1985-2020 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.25.28612.0 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\Temp>p.exe Block 1: Arrays with wrapper to intrinsic MATMUL Average CPU Time: 2.209999561309815E-002 Block 2: PDT for a square matrix with a type-bound operator for multiplication Average CPU Time: 2.599999904632568E-002 Block 3: PDT for a square matrix with type-specific MATMUL procedure Average CPU Time: 2.079999446868896E-002 C:\Temp>p.exe Block 1: Arrays with wrapper to intrinsic MATMUL Average CPU Time: 2.169995307922363E-002 Block 2: PDT for a square matrix with a type-bound operator for multiplication Average CPU Time: 2.600002288818359E-002 Block 3: PDT for a square matrix with type-specific MATMUL procedure Average CPU Time: 2.220001220703125E-002 C:\Temp>

So you will notice the use of PDT with a type-specific procedure toward the operation of interest is similar to the intrinsic data type, a floating-point array of an arbitrary precision. You will notice the use of the aesthetic 'blackboard abstraction' in the form of 'c = a*b' incurs some performance penalty. This penalty is small in absolute terms but can be appreciable on a relative basis. My observation is the root cause has to do with the 'c = ..' copy operation involving the derived type as well as some overhead with the use of the type-bound procedure for the generic binding toward operator(*).

When such a penalty is unacceptable, don't use the fancy stuff which allows the so-called ADT calculus with user-defined types such as 'c = a*b'. Use simple subroutines with intrinsic data types such as REAL instead.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Fortran Fan, thank you for taking the time to give me such a detailed response. I will try it in the next couple of days and let you know. One technical matter is that you use elemental/pure procedures and I want to use zgemm, which is not pure. Therefore I might have to try the difference of not using such procedures.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I would like to see FortranFan's example run with more trials and some way of making sure that elements of the result c are used outside the timing. The times reported are a couple of hundredths of a second, which is at most one or two OS clock ticks, and the compiler could have eliminated the work since it didn't see it being used. It could be something as simple as adding an element of c to a variable that is then printed outside the loop.

I'm also skeptical of timing just the call to the various multiply operations. It would be better if the random_number calls were done outside the loop to prepopulate the a and b arrays, then the loop which accumulates results for c is timed. Even then I'd check the assembly to make sure that I understand what is being timed.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Even then I'd check the assembly to make sure that I understand what is being timed.

Even when you are timing the same operations -- 10000 times per day -- the results will move all over the place, we track the loop time for our main routines and load the results to a SQL server, the results show the type of distribution one would expect. The computer is a machine but it does vary.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Steve Lionel (Ret.) (Blackbelt) wrote:I would like to see FortranFan's example run with more trials and some way of making sure that elements of the result c are used outside the timing. The times reported are a couple of hundredths of a second, which is at most one or two OS clock ticks, and the compiler could have eliminated the work since it didn't see it being used. It could be something as simple as adding an element of c to a variable that is then printed outside the loop.

I'm also skeptical of timing just the call to the various multiply operations. It would be better if the random_number calls were done outside the loop to prepopulate the a and b arrays, then the loop which accumulates results for c is timed. Even then I'd check the assembly to make sure that I understand what is being timed.

It'll be interesting to see what OP (and any other reader) find(s) and report(s) here. I had instinctively and immediately thought about and factored all the aspects mentioned in the above post and I deliberately left the example the way I did to see how OP (and any other reader) build on it.

Theoretically the FLOP instructions with matrix multiplication are O(n**3) and as such, the operation timed here is indeed a good case for the topic of interest to OP, "Performance penalty for type matrix versus arrays."

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I tried to run FortranFan's suggestion but I failed. I am getting (random) segmentation faults. Once I was able to run it and I got the (admittedly limited) results:

#----------------------------------------------------------------------------------------------------- #my: N MatmulMyMatrix MatmulArray2 FixedArray2 MatmulMatrix #----------------------------------------------------------------------------------------------------- my 30 0.13910E-04 0.65755E-05 0.65784E-05 0.21848E-04 my 60 0.56022E-04 0.31077E-04 0.30943E-04 0.88976E-04 my 120 0.29360E-03 0.19568E-03 0.19771E-03 0.41697E-03 my 190 0.10149E-02 0.74705E-03 0.76242E-03 0.13023E-02 my 250 0.22897E-02 0.17427E-02 0.17313E-02 0.27372E-02

The last column is FortranFan's suggestion, which seems to be even slower than the naive type(MyMatrix). In order to conduct more tests, I need to overcome the SIGSEGV fault. The FixedArray2 concerns statically allocated arrays and a direct call to zgemm.

When I am examining the core file, the problem seems to be the deallocation of the types:

#14 0x00000000004419bd in for_dealloc_allocatable () #15 0x00000000004174f0 in do_deallocate_all () #16 0x000000000041723d in do_deallocate_all () #17 0x000000000040c1ed in matrix_benchmark::test_matmulmatrix (n=30, delt=1, nops=44314, dt_test=1) at benchmarkMyMatrix.f90:188

I don't know if I am doing something trivially wrong, or if this a bug of the ifort version that I am using.

I am compiling with

ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn -mkl benchmarkMyMatrixUutilities.f90 benchmarkMyMatrix.f90

(and various other trivial changes). The details of the error stack, as well as my environment and source code, can be found in the attached README file.

NOTE: Please don't compare the above mentioned times with the ones I reported on my post originally. The times were obtained on my laptop and not on the machine used originally. I was not able to get such a report from the other machines due to the SEGV.

Concerning letting the overhead of random_number add to the wall time:

a. the calls are only 5, independent of N, so the contribution becomes negligible for large N

b. the times should be used for comparison between the different methods. The contribution of random_number is the same for every method (Statistically speaking).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Konstantinos A. wrote:I tried to run FortranFan's suggestion but I failed. ..

Please first can you take the example as-is and run on your system and report results here?

Shown below is a somewhat modified case, can you please try that? Please make no code changes, you'll note N = 500. It'll be great if you can first try this one case.

Please use the equivalent of Intel Fortran compiler options toward O3 optimization and standard-semantics, so the command might be ifort -O3 -standard-semantics <filename>

module mykinds_m integer, parameter :: DP = selected_real_kind( p=12 ) end module module mult_m use mykinds_m, only : DP contains pure subroutine mymult( a, b, c ) real(DP), contiguous, intent(in) :: a(:,:) real(DP), contiguous, intent(in) :: b(:,:) real(DP), contiguous, intent(inout) :: c(:,:) ! Checks elided c = matmul( a, b ) end subroutine end module module mat_m use mykinds_m, only : DP use mult_m, only : mymult type :: square_matrix_t(n) integer, len :: n = 2 ! 2x2 matrix by default real(DP) :: x(n,n) contains private procedure, pass(a) :: mult_mat generic, public :: operator(*) => mult_mat end type contains elemental function mult_mat( a, b ) result( r ) class(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)) :: r call mymult( a%x, b%x, r%x ) end function elemental subroutine mult_mat_t( a, b, c ) type(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)), intent(inout) :: c call mymult( a%x, b%x, c%x ) end subroutine end module program p use mykinds_m, only : DP integer, parameter :: n = 500 integer, parameter :: ntrial = 12 integer :: i real(DP) :: cput(ntrial), t1, t2 call random_seed() blk1: block use mult_m, only : mymult real(DP), allocatable :: a(:,:), b(:,:), c(:,:) allocate( a(n,n), b(n,n), c(n,n) ) print *, "Block 1: Arrays with wrapper to intrinsic MATMUL" do i = 1, ntrial if ( i == ntrial ) then call imat( a ) call imat( b ) else call random_number( a ) call random_number( b ) end if call my_cpu_time( t1 ) call mymult( a, b, c) call my_cpu_time( t2 ) cput(i) = t2-t1 if ( i == 1 ) then print *, "First iteration: " print *, "a(1,1), b(1,1), c(1,1): ", a(1,1), b(1,1), c(1,1) t2 = sum( c ) print *, "sum(c) = ", t2 end if end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") print *, "Last iteration: check using identity matrix" print *, "a(n,n), b(n,n), c(n,n): ", a(n,n), b(n,n), c(n,n) end block blk1 print * blk2: block use mat_m, only : square_matrix_t type(square_matrix_t(n=:)), allocatable :: a, b, c print *, "Block 2: PDT for a square matrix with a type-bound operator for multiplication" allocate( square_matrix_t(n=n) :: a, b, c ) do i = 1, ntrial if ( i == ntrial ) then call imat( a%x ) call imat( b%x ) else call random_number( a%x ) call random_number( b%x ) end if call my_cpu_time( t1 ) c = a*b call my_cpu_time( t2 ) cput(i) = t2-t1 if ( i == 1 ) then print *, "First iteration: " print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1) t2 = sum( c%x ) print *, "sum(c%x) = ", t2 end if end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") print *, "Last iteration: check using identity matrix" print *, "a%x(n,n), b%x(n,n), c%x(n,n): ", a%x(n,n), b%x(n,n), c%x(n,n) end block blk2 print * blk3: block use mat_m, only : square_matrix_t, mult_mat_t type(square_matrix_t(n=:)), allocatable :: a, b, c print *, "Block 3: PDT for a square matrix with type-specific MATMUL procedure" allocate( square_matrix_t(n=n) :: a, b, c ) do i = 1, ntrial if ( i == ntrial ) then call imat( a%x ) call imat( b%x ) else call random_number( a%x ) call random_number( b%x ) end if call my_cpu_time( t1 ) call mult_mat_t( a, b, c ) call my_cpu_time( t2 ) cput(i) = t2-t1 if ( i == 1 ) then print *, "First iteration: " print *, "a%x(1,1), b%x(1,1), c%x(1,1): ", a%x(1,1), b%x(1,1), c%x(1,1) t2 = sum( c%x ) print *, "sum(c%x) = ", t2 end if end do ! Calc average time excluding best and worst times t1 = sum(cput) - maxval(cput) - minval(cput) print *, "Average CPU Time: ", t1/real(ntrial-2,kind=DP), new_line("") print *, "Last iteration: check using identity matrix" print *, "a%x(n,n), b%x(n,n), c%x(n,n): ", a%x(n,n), b%x(n,n), c%x(n,n) end block blk3 stop contains subroutine my_cpu_time( time ) use, intrinsic :: iso_fortran_env, only : I8 => int64 !.. Argument list real(DP), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine my_cpu_time subroutine imat( x ) real(DP), intent(inout) :: x(:,:) integer :: i x = 0.0_dp do i = 1, size(x,dim=1) x(i,i) = 1.0_dp end do end subroutine end program p

**Sample **output based on this code is as follows:

C:\Temp>ifort /standard-semantics /O3 /warn:all /stand:f18 p.f90 -o p.exe Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306 Copyright (C) 1985-2020 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.25.28612.0 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\Temp>p.exe Block 1: Arrays with wrapper to intrinsic MATMUL First iteration: a(1,1), b(1,1), c(1,1): 0.427986685828709 0.256966050640733 120.717640411730 sum(c) = 31217750.4781149 Average CPU Time: 2.570002079010010E-002 Last iteration: check using identity matrix a(n,n), b(n,n), c(n,n): 1.00000000000000 1.00000000000000 1.00000000000000 Block 2: PDT for a square matrix with a type-bound operator for multiplication First iteration: a%x(1,1), b%x(1,1), c%x(1,1): 0.748272345682208 0.580942401839468 121.718967899213 sum(c%x) = 31198625.6667619 Average CPU Time: 2.649998664855957E-002 Last iteration: check using identity matrix a%x(n,n), b%x(n,n), c%x(n,n): 1.00000000000000 1.00000000000000 1.00000000000000 Block 3: PDT for a square matrix with type-specific MATMUL procedure First iteration: a%x(1,1), b%x(1,1), c%x(1,1): 0.549035960188161 0.920795003076816 127.775432983094 sum(c%x) = 31241430.1548635 Average CPU Time: 1.960003376007080E-002 Last iteration: check using identity matrix a%x(n,n), b%x(n,n), c%x(n,n): 1.00000000000000 1.00000000000000 1.00000000000000 C:\Temp>

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear FortranFan, your code runs seemlessly:

Block 1: Arrays with wrapper to intrinsic MATMUL First iteration: a(1,1), b(1,1), c(1,1): 0.414588091541075 0.573363950353085 131.731703981318 sum(c) = 31299857.5604336 Average CPU Time: 2.420210838317871E-002 Last iteration: check using identity matrix a(n,n), b(n,n), c(n,n): 1.00000000000000 1.00000000000000 1.00000000000000 Block 2: PDT for a square matrix with a type-bound operator for multiplication First iteration: a%x(1,1), b%x(1,1), c%x(1,1): 0.976950374450897 0.904038053398594 128.281468064703 sum(c%x) = 31216491.4142846 Average CPU Time: 2.745497226715088E-002 Last iteration: check using identity matrix a%x(n,n), b%x(n,n), c%x(n,n): 1.00000000000000 1.00000000000000 1.00000000000000 Block 3: PDT for a square matrix with type-specific MATMUL procedure First iteration: a%x(1,1), b%x(1,1), c%x(1,1): 0.168759306587530 0.847564522196995 126.620520759212 sum(c%x) = 31288743.1630736 Average CPU Time: 2.389459609985351E-002 Last iteration: check using identity matrix a%x(n,n), b%x(n,n), c%x(n,n): 1.00000000000000 1.00000000000000 1.00000000000000

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

FF>

elemental function mult_mat( a, b ) result( r ) class(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)) :: r ... c = a*b

As used by operator(*), a, b, and c are scalars of type square_matrix (not an array of type square_matrix).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jimdempseyatthecove (Blackbelt) wrote:FF>

elemental function mult_mat( a, b ) result( r ) class(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)) :: r ... c = a*bAs used by operator(*), a, b, and c are scalars of type square_matrix (not an array of type square_matrix).

Jim Dempsey

Jim, not sure I understand - note you can try the operation with arrays if you like, here's a unit test as a starting point:

program p use mykinds_m, only : DP real(DP), parameter :: EPS_DP = epsilon( 0.0_dp ) integer, parameter :: n = 3 integer :: i call random_seed() blk: block use mat_m, only : square_matrix_t type(square_matrix_t(n=:)), allocatable :: a(:), b(:), c(:) print *, "Elemental C = AB multiplication for arrays of PDT toward a square matrix" allocate( square_matrix_t(n=n) :: a(n), b(n), c(n) ) do i = 1, n call imat( a(i)%x ) call random_number( b(i)%x ) end do c = a*b !<-- Elemental multiplication of PDT arrays print *, "First element: " print *, "a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1): ", a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1) chk: do i = 1, n if ( any( abs(c(i)%x-b(i)%x) > EPS_DP ) ) then print *, "Error: C /= B following C = IB multiplication for b(", i, ")" exit chk end if end do chk end block blk print * stop contains subroutine imat( x ) real(DP), intent(inout) :: x(:,:) integer :: i x = 0.0_dp do i = 1, size(x,dim=1) x(i,i) = 1.0_dp end do end subroutine end program p

This example goes with the same "library" code:

module mykinds_m integer, parameter :: DP = selected_real_kind( p=12 ) end module module mult_m use mykinds_m, only : DP contains pure subroutine mymult( a, b, c ) real(DP), contiguous, intent(in) :: a(:,:) real(DP), contiguous, intent(in) :: b(:,:) real(DP), contiguous, intent(inout) :: c(:,:) ! Checks elided c = matmul( a, b ) end subroutine end module module mat_m use mykinds_m, only : DP use mult_m, only : mymult type :: square_matrix_t(n) integer, len :: n = 2 ! 2x2 matrix by default real(DP) :: x(n,n) contains private procedure, pass(a) :: mult_mat generic, public :: operator(*) => mult_mat end type contains elemental function mult_mat( a, b ) result( r ) class(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)) :: r call mymult( a%x, b%x, r%x ) end function elemental subroutine mult_mat_t( a, b, c ) type(square_matrix_t(n=*)), intent(in) :: a type(square_matrix_t(n=a%n)), intent(in) :: b type(square_matrix_t(n=a%n)), intent(inout) :: c call mymult( a%x, b%x, c%x ) end subroutine end module

Here's sample output:

C:\Temp>ifort /standard-semantics /warn:all /stand:f18 p.f90 -o p.exe Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.216 Build 20200306 Copyright (C) 1985-2020 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.25.28612.0 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\Temp>p.exe Elemental C = AB multiplication for arrays of PDT toward a square matrix First element: a(1)%x(1,1), b(1)%x(1,1), c(1)%x(1,1): 1.00000000000000 0.294981195625570 0.294981195625570 C:\Temp>

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Konstantinos A. wrote:Dear FortranFan, your code runs seemlessly: ..

Ok, good. Now you can try a few things in this second example if you like:

- Try a few larger sizes of N up to a reasonable size depending on your system configuration e.g., whether you've 64 GB or 32 GB environment vs 16 or 8 GB. As you know, if N becomes too large, you'll be dealing with virtual memory management and its impact on performance which can have a lot more variance from run to run.
- Different values of ntrial and whether they make a difference, of course as checked over multiple runs at same values of n and ntrial,
- Swapping intrinsic MATMUL with an algorithm of your choice; you may need to remove ELEMENTAL on 'mult_mat*' or apply "IMPURE",

You can then compare the times of the 3 approaches: with arrays, with derived types and a subroutine, and derived types with defined operations.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

FF>> Your post #16 does show use of an array of type square_matrix_t, and thus an appropriate use of elemental attribute of the function.

The original post showed no need for an array of type square_matrix_t, but he did state a preference for using zgemm. In which case use of both ELEMENTAL and IMPURE might be desired.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear FortranFan,

I have had some problems with your code:

1. When compiling with

ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn

with n,ntrial = 500, 10000

It ran for 103 minutes (!!) I got an error that produced a core file (see attached benchmarkFortranFan_v01.out for errors)

2. When compiling ifort -O3 -heap-arrays -standard-semantics -xAVX -parallel

It ran for about 2 minutes and terminated with signal SIGSEGV (see attached benchmarkFortranFan_v02.out for errors)

3. When compiled with

ifort -O3 -heap-arrays -standard-semantics benchmarkFortranFan.f90 -o benchff

it finished normally after about 6 minutes (output in benchmarkFortranFan_v03.out)

4. I repeated 3. using ntrial = 100000 and it finished normally 109 minutes

(output benchmarkFortranFan_v04.out)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Konstantinos A. wrote:.. I have had some problems with your code:

1. When compiling with

ifort -g -heap-arrays -standard-semantics -stand:f18 -nowarn

with n,ntrial = 500, 10000

It ran for 103 minutes (!!) I got an error that produced a core file (see attached benchmarkFortranFan_v01.out for errors)2. When compiling ifort -O3 -heap-arrays -standard-semantics -xAVX -parallel

It ran for about 2 minutes and terminated with signal SIGSEGV (see attached benchmarkFortranFan_v02.out for errors) ..

1. When compiling with -g compiler option to generate debug information and which disables optimization by default, setting 'n' and 'ntrial' to such large numbers makes little sense. If your purpose here is to step through the code in debug mode, why not try smaller n and ntrial? Even with n and ntrial in single digits will give you a good idea of the code while allowing you to complete the instructions in a reasonable timeframe. Note with -g option, the timing results are mostly meaningless.

2. Not sure what -parallel really does for you in this case, you need to consult Intel documentation and perhaps their online support if your terms allow. Same with -xAVX. You may want to look into** -xHost option instead**, or better yet consult with Intel re: this option and what it does with MATMUL intrinsic or Intel MKL ZGEMM, etc.

Thanks for your followup, glad you got 3 and 4 working. However I do think a reasonably small value for ntrial, say in the 12 to 24 range, is sufficient to get a good estimate of the median performance demand with a given compute approach. I don't agree with Quote #9. Anyways, a value of 10,000 you're using for ntrial looks like an overkill.

Your results with 3 and 4 suggest the same as before, a derived type whose components are arrays can be coded up to yield about the same performance as the most ordinary of the 'containers' in Fortran i.e., the arrays themselves.

The key point is to be careful with data fetches on the components of the derived type, as shown in the example here. The advantage with the derived type, as you know, is the consumer of said type is working with a higher level abstraction that can be 'far closer' than with ordinary arrays to 'domain knowledge' and expertise, in this case linear algebra. This is in addition to other advantages in terms of data encapsulation, information hiding, improved maintenance, readability and ease-of-use of working with the 'library' code.

But now, if one takes this further and brings the abstractions even closer to the ideal of 'For(mula) Tran(slation)' via the use of defined operations that allows writing compact 'equations' such as 'C = A*B', there is a bit of a price to be paid. But this is assuming the library is designed well; note poorly constructed code involving defined operations can really impact performance negatively Library developers and library consumers have to decide on this performance aspect relative to their needs.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Some updates:

1. With Fortran 19.1.1.217, I got rid of the segmentation faults for the parameterized derived types I reported above (ifort 19.0.5.281, -O3 -heap-arrays -xHost -parallel -mkl=parallel). The -g option is problematic as before.

2. I made a comparison of parameterized derived types (FortranFan's suggestion) versus ordinary types, allocatable arrays and statically allocated arrays. When using matrix multiplication via a subroutine, there is no noticeable time difference for N>=30.

3. If one uses operator overloading, the simple type is faster than the parameterized type for, more or less, N<=1000.

4. For ordinary allocatable arrays, using a function or subroutine for zgemm is the same, as is for the statically allocated arrays.

5. The intrinsic MATMUL is slower for matrix-matrix multiplication than using zgemm for N < 300. The overhead max ranges from 50% to 80% depending on processor and ifort version (see attached array_benchark.README)

6. Overloading multiplication for allocatable arrays (operator(.mm.), such that m3 = m1 .mm. m2) for arrays puts a less than 10% overhead for N<=500 (desktop) and N<= 2000 (supercomputer)

Maximum overhead is 20% for desktop (N=60) and 40% for supercomputer (N=120) (see attached array_benchark.README)

7. Expected scaling (~ N**3 for matrix-matrix ~N**2 for matrix-vector) sets in for N>=N0, where N0=120-750 for matrix-matrix and N0=1000-2000 for matrix-vector depending on compiler version and processor (see attached array_benchark.README)

The following table was produced by using the attached program: (see also attached file table.txt)

---------------------------------------------------------------------------------------------------------------------------------------- N (1) (2) (3) (4) (5) (6) (7) ---------------------------------------------------------------------------------------------------------------------------------------- 30 0.61541E-05 0.36631E-05 0.37419E-05 0.16168E-04 0.36311E-05 0.36501E-05 0.36273E-05 60 0.22306E-04 0.13665E-04 0.13636E-04 0.59970E-04 0.13441E-04 0.13573E-04 0.13446E-04 120 0.10719E-03 0.77839E-04 0.77894E-04 0.26082E-03 0.78003E-04 0.78610E-04 0.77942E-04 190 0.36490E-03 0.28585E-03 0.30193E-03 0.74019E-03 0.28684E-03 0.28714E-03 0.28760E-03 250 0.79697E-03 0.65101E-03 0.64590E-03 0.14293E-02 0.64645E-03 0.64529E-03 0.64487E-03 500 0.63319E-02 0.53301E-02 0.53227E-02 0.83781E-02 0.52751E-02 0.53102E-02 0.53149E-02 1000 0.45123E-01 0.41333E-01 0.41729E-01 0.52862E-01 0.40963E-01 0.40964E-01 0.40444E-01 2000 0.36451 0.33073 0.33932 0.37915 0.33874 0.33724 0.33281 4000 2.5571 2.4191 2.4375 2.6045 2.4057 2.4154 2.4033 Compiled with: ifort -O3 -heap-arrays -xAVX -parallel benchmarkMyMatrixUutilities.f90 benchmarkMyMatrix.f90 -mkl=parallel (1) type MyMatrix: m3 = m1 * m2 (2) allocatable array: m3 = lmatmul(m1,m2) (3) static arrays: direct call to zgemm (4) parameterized type: m3 = m1 * m2 (5) parameterized type: all smatmul(m1,m2,m3) (6) type MyMatrix: call smatmul(m1,m2,m3) --------------------------------------------------------------------------------------------------------- (1) Matrix -> (*) -> matrix_matmul (f) -> sub_array_matmul_array (s) -> zgemm (2) array -> lmatmul (f) -> array_matmul_array (f) -> zgemm (3) static array -> zgemm (4) MyMatrix -> (*) -> mymatrix_mult_mymatrix (f) -> lmatmul (f) -> array_matmul_array (f) -> zgemm (5) Matrix -> smatmul (s) -> sub_matrix_matmul (s) -> sub_array_matmul_array (s) -> zgemm (6) MyMatrix -> smatmul (s) -> mymatrix_mult_mymatrix_sub (s) -> sub_array_matmul_array (s) -> zgemm (7) array -> smatmul (s) -> sub_array_matmul_array (s) -> zgemm

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page