Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6537 Discussions

Is there a more efficient way for element-by-element multiplication for matrix?


Hi, everyone.

I was writing some FORTRAN codes for computing the DOT of two matrix which idea is as same as dot of vectors. Firstly, it calculates the element-wise product of the two matrix. Secondly, it calculates the sum of all the elements of the matrix returned in the first-step calculation. So I thought about two ways that would help. 

In the first way, I will transfer the two matrix to respective vectors. And then, I can use the "dot subroutine of vectors" in BLAS(MKL) directly. Considering some specific problems, I prefer matrix calculation to vector calculation. In the second and prefered way, I directly calculate the element-by-element product of the two matrix, and then sum up all the elements of resulting product matrix using the "sum subroutine".

I doubt if the two alternative solutions is most efficient, since the matrix are extremly large.

Any suggestions?

0 Kudos
4 Replies


Could you please tell some information like

How the matrix is stored? in continuous way or not.  and what is the matrix's size ?

Your OS and cpu processor etc, Intel fortran compiler  ?  

do you use threaded MKL (mkl_intel_thread.x) or sequential MKL (mkl_sequential) 

In generally, the continuous matrix in Fortran should be same as vector,  so you can use the MKL dot of vector.  

And the MKL dot are threaded. it may run in parallel in multi-core machine. 

Threaded BLAS Level1 and Level2 Routines

In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the
value of s, d, c, or z.
The following routines are threaded with OpenMP* for Intel® Core™2 Duo and Intel® Core™ i7 processors:
• Level1 BLAS:
?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot
• Level2 BLAS:
?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv

Regarding  "directly calculate the element-by-element product of the two matrix, and then sum up all the elements of resulting product matrix using the "sum subroutine". so are they two loops to  element-by-element product with Intel Fotran compiler  , then  sum function?  

Considering Intel Fotran compiler can optimize such loop code,  like  Some fortran routine or Array notation, ;  whatever your matrix looks like, you may compare two implementation and select  the better performance. 

Best Regards,

Ying H.

Intel MKL Support 


Hi, Ying.

Thanks for your suggetions an very sorry for the late response.

Here is my supplementary information acording to your request.

Most of the matrices  are stored in a discontious way. And the matrix's size might be up to, for example, 100*100*80, such as REAL(8) :: A(100,100,80). So does that mean the optimization for the loop won't work? 

At present I only consider the sequential MKL. But in the future i might have to use the parallel MKL in case of the low efficiency.

My program will run in Windows OS. And I built it with  Intel® Parallel Studio XE Cluster Edition for students.( ).

Many thanks in advance.



Hi Rubin,

Thank you for the information.  as you understand,  you may want to do bunch of   REAL(8) :: AI(100,100,80) ,  REAL(8) :: BI(100,100,80)   sum( AI*.BI), right?  Then I may suggest you try some MKL function(example) in MKL install directory, for example,


ddot(100x100x80, A1(:,:,:), 1, B1(:,:,:), 1)

and if these operation are batched, you can consider combine  these matrix to do

one dgemm

or  batched dgemm.



the optimization for the loop can work.  and   both sequential MKL and  parallel MKL  can work also.  If you work on multi-core cpu, the parallel MKL may have better efficiency.

Best Regards,





Thanks. I will try this: ddot(100x100x80, A1(:,:,:), 1, B1(:,:,:), 1).