Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

matrix multiplication speedup

Bowen_M_
Beginner
713 Views

Hi,

I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100),  I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X * X^T, where X is the i_th column of X. 

By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?

Thanks 

0 Kudos
2 Replies
VipinKumar_E_Intel
713 Views

In the first case of blas dgemm, there are multiple optimizations techniques are used, that include loop reordering, loop unrolling, subdividing into blocks, vectorization, parallelizations etc.  These help to keep the frequently used data in cache, reduce branch instructions, utilize DLP (data level parallelism) and TLP (thread level parallelism) etc.  Many other optimizations are also done in various MKL routines.

--Vipin

 

0 Kudos
TimP
Honored Contributor III
713 Views

Comparison of your code vs. reference BLAS source, and consideration of your compile options (and choice of compiler), would also be relevant to understanding these performance questions.

0 Kudos
Reply