I was writing some FORTRAN codes for computing the DOT of two matrix which idea is as same as dot of vectors. Firstly, it calculates the element-wise product of the two matrix. Secondly, it calculates the sum of all the elements of the matrix returned in the first-step calculation. So I thought about two ways that would help.
In the first way, I will transfer the two matrix to respective vectors. And then, I can use the "dot subroutine of vectors" in BLAS(MKL) directly. Considering some specific problems, I prefer matrix calculation to vector calculation. In the second and prefered way, I directly calculate the element-by-element product of the two matrix, and then sum up all the elements of resulting product matrix using the "sum subroutine".
However, I doubt if the two alternative solutions is most efficient, since the matrix are extremly large.
Could you please tell some information like
How the matrix is stored? in continuous way or not. and what is the matrix's size ?
Your OS and cpu processor etc, Intel fortran compiler ?
do you use threaded MKL (mkl_intel_thread.x) or sequential MKL (mkl_sequential)
In generally, the continuous matrix in Fortran should be same as vector, so you can use the MKL dot of vector.
And the MKL dot are threaded. it may run in parallel in multi-core machine.
Threaded BLAS Level1 and Level2 Routines
In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the
value of s, d, c, or z.
The following routines are threaded with OpenMP* for Intel® Core™2 Duo and Intel® Core™ i7 processors:
• Level1 BLAS:
?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot
• Level2 BLAS:
?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv
Regarding "directly calculate the element-by-element product of the two matrix, and then sum up all the elements of resulting product matrix using the "sum subroutine". so are they two loops to element-by-element product with Intel Fotran compiler , then sum function?
Considering Intel Fotran compiler can optimize such loop code, like Some fortran routine or Array notation https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran, ; whatever your matrix looks like, you may compare two implementation and select the better performance.
Intel MKL Support
Thanks for your suggetions an very sorry for the late response.
Here is my supplementary information acording to your request.
Most of the matrices are stored in a discontious way. And the matrix's size might be up to, for example, 100*100*80, such as REAL(8) :: A(100,100,80). So does that mean the optimization for the loop won't work?
At present I only consider the sequential MKL. But in the future i might have to use the parallel MKL in case of the low efficiency.
My program will run in Windows OS. And I built it with Intel® Parallel Studio XE Cluster Edition for students.( https://software.intel.com/en-us/qualify-for-free-software/student ).
Many thanks in advance.
Thank you for the information. as you understand, you may want to do bunch of REAL(8) :: AI(100,100,80) , REAL(8) :: BI(100,100,80) sum( AI*.BI), right? Then I may suggest you try some MKL function(example) in MKL install directory, for example,
ddot(100x100x80, A1(:,:,:), 1, B1(:,:,:), 1)
and if these operation are batched, you can consider combine these matrix to do
or batched dgemm.
the optimization for the loop can work. and both sequential MKL and parallel MKL can work also. If you work on multi-core cpu, the parallel MKL may have better efficiency.