matmul speed test with 4 situation

alsoran · ‎01-08-2012

Test result:

matmul(fixed,fixed) costs 0.4524 sec

matmul(alloc,alloc) costs 0.5460 sec

matmul(mytype.fixed,mytype.fixed) costs 0.4680 sec

matmul(mytype.alloc,mytype.alloc) costs 26.6762 sec

CAN YOU TELL ME why did the constom-data-type matrix consume such long time?

[fortran]MODULE MOD
implicit none
    integer,parameter:: N=1000
    type:: myt
        real(8):: fixed_mat1(N,N)
        real(8):: fixed_mat2(N,N)
        real(8),allocatable:: alloc_mat1(:,:)
        real(8),allocatable:: alloc_mat2(:,:)
    endtype myt
    real(8):: fixed_mat1(N,N)
    real(8):: fixed_mat2(N,N)
    real(8),allocatable:: alloc_mat1(:,:)
    real(8),allocatable:: alloc_mat2(:,:)
    real(8):: resu_mat(N,N)
ENDMODULE MOD

!!##########main############################
PROGRAM MAIN
USE MOD
implicit none
    type(myt):: mytype
    real(4):: t1,t2
!!#####matrix elements initialize######################    
    call random_number(fixed_mat1)
    call random_number(fixed_mat2)
    mytype.fixed_mat1 = fixed_mat1
    mytype.fixed_mat2 = fixed_mat2
    allocate(alloc_mat1,source=fixed_mat1)
    allocate(alloc_mat2,source=fixed_mat2)
    allocate(mytype.alloc_mat1,source=fixed_mat1)
    allocate(mytype.alloc_mat2,source=fixed_mat2)
!!#####measure the timeand compare them####################
    call cpu_time(t1)
        resu_mat = matmul(fixed_mat1,fixed_mat2)
    call cpu_time(t2)
    write(*,'(A,F8.4,A)') 'matmul(fixed,fixed) costs',t2-t1,' sec'
!--------------------------------------------------------------------   
    call cpu_time(t1)
        resu_mat = matmul(alloc_mat1,alloc_mat2)
    call cpu_time(t2)
    write(*,'(A,F8.4,A)') 'matmul(alloc,alloc) costs',t2-t1,' sec'
!--------------------------------------------------------------------    
    call cpu_time(t1)
        resu_mat = matmul(mytype.fixed_mat1,mytype.fixed_mat2)
    call cpu_time(t2)
    write(*,'(A,F8.4,A)') 'matmul(mytype.fixed,mytype.fixed) costs',t2-t1,' sec'
!----------------------------------------------------------------------    
    call cpu_time(t1)
        resu_mat = matmul(mytype.alloc_mat1,mytype.alloc_mat2)
    call cpu_time(t2)
    write(*,'(A,F8.4,A)') 'matmul(mytype.alloc,mytype.alloc) costs',t2-t1,' sec'
    
ENDPROGRAM[/fortran]

TimP · ‎01-09-2012

On my early Core I7 desktop model, using current ifort, I get no consistent increase in time for your last case, once I get the stack limit adjusted. I have 6GB RAM with an unsupported combination of DIMM types, presumably effectively DDR3-1066. If you have only a small amount of RAM, maybe you should deallocate the arrays when you are done with them.
Do you have a reason for using non-standard syntax? It runs faster for me when the VAX/VMS structure notation is changed to standard syntax, except that the last case speeds up only when running on an increased stack allocation (not with /heap-arrays).

IDZ_A_Intel · ‎01-09-2012

In addition to TimP's suggestion of deallocation, reverse the order in which you allocate and test the various combinations. This should eliminate virtual memory paging issues (assuming your type can be fully resident in RAM as opposed to in the page file).

Jim Dempsey

TimP · ‎01-09-2012

For what it's worth, I note that it runs faster yet with one of the options which replace matmul by MKL. For example,
gfortran -O3 -fexternal-blas -L/opt/xeon/composer_xe_2011_sp1.8.273/mkl/lib/intel64/ ar1.f90 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential
(using e.g. gfortran 4.7), but the gfortran 4.5 versions available for Windows are failing, besides not being compatible with MKL.
I thought there should be an equivalent ifort option.

Steven_L_Intel1 · ‎01-09-2012

There is such an option: /Qopt-matmul

alsoran · ‎01-12-2012

and then I switch the "Interprocedural Optimization" to "Multi-file"(/Qipo) or"Single-file"(/Qip), the last case worded faster than before, and there's no huge differences between another 3 cases.

however , when I change the 4th cases fromDerived-data-types to its arrayform,for instance:

mytype.alloc_mat1--->mytype(1).alloc_mat1,mytype.alloc_mat2--->mytype(1).alloc_mat2 .....

then speed slow down, and I have no ideas again.

PS: all the test was in the /O3 optimization

William_Gray · ‎01-28-2012

you said that the CALCULATION TIME is different (for each case).

i was just wondering, is the "ANSWER" (stored in the matrix RESU_MAT) also different ?

the reason i ask this is :-

years ago, i was using "Visual Fortran" software ("Digital Visual Fortran", i think). it had the MATMUL intrinsic function. anyway, although i "think" all of my fortran code was correct, i would sometimes get incorrect results -- which seemed to be caused by the results calculated by MATMUL. but, at the time, i think the fortran software had a few bugs. anyway, ever since then, i tend to write my own matrix manipulation code (e.g. matrix multiplication) -- just to be safe.

of course, all of this happened long before Intel took over the reigns (of "Visual Fortran"). so, i'm sure the (Intel Visual Fortran) version of MATMUL that you are using works correctly :)

TimP · ‎01-28-2012

Matmul results would change slightly (within the bounds of roundoff error) when you enable or disable /Qopt-matmul, possibly also when you change -O optimization level.