Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

What is gemm doing what I can't do

may_ka
Beginner
672 Views

Hi there

I am using gemm for multiplication of a matrix A with it's transpose. I wouldn't bother thinking about what it is doing if A hadn't four specific properties: a) as it contains only 0,1 or 2 it is stored as the smallest possible integer, and b) the result of the multiplication AA' is clearly a symmetric matrix, thus operations on n(n-1)/2 cells are redundant, c) A will be of (100000+x:10000000+y) in the not too distant future, and D) since AA'  is dense but symmetric only the upper triangular need to be stored. Therefore, I tried to mimic gemm studying the gemm source code in public BLAS. While I could outperform ifort compiler build matmul (also when not allowing for the symmetric properties of A (filling all cells)), I couldn't get even near to gemm loaded from mkl. Here is a piece of code filling the lower triangular only:

Integer*8:: ng,ns,c1,c2,c3
Integer*8,allocatable::m(:,:)
Integer*1,allocatable::g(:,:),gt(:,:)
Integer*1::zero=0,tmp
Real*8::mr

ng=500
ns=145294

Allocate(g(ng,ns),gt(ns,ng),m(ng,ng),mr(ng,ng))

Do c1=1,ng
  Do c2=1,ns
    if(g(c1,c2).NE.ZERO) Then
      tmp=g(c1,c2)
      Do c3=c1,ng
        m(c3,c1)=m(c3,c1)+g(c3,c2)*tmp
      End Do
    end if
  End Do
End Do

 

Compiler options were:

ifort -mkl -warn nounused -warn declarations -static -O3 -parallel

the gemm call was:

call gemm(A=real(g,8),B=real(g,8),C=mr,transa="T")

the matmul call was

gt=transpose(g)
m=matmul(int(g,8),int(gt,8))

The time for the loop approach was 3.6 real seconds, for the gemm approach 0.8 real seconds, and for matmul 10 real seconds.

Any idea??

Thanks a lot.

Karl

0 Kudos
4 Replies
Gennady_F_Intel
Moderator
672 Views

Karl, MKL does a lot of different optimizations with regard to ?gemm computations for reusing all hardware features available and many beyond of that.... You may look at this article 

https://web.a.ebscohost.com/ehost/pdfviewer/pdfviewer?sid=06c21672-6da4-43ec-aefd-22e389210961%40sessionmgr4002&vid=0&hid=4201&preview=false 

to show short description what BLAS does with that.

 

 

0 Kudos
may_ka
Beginner
672 Views

Hi,

I guessed that MKL optimizes something, but that does not solve the problem that AA' will soon become to large to be stored in full. Maybe it would be easiest if Intel would simply add a crude function which tackles that problem because AA' arises in many statistical applications and from a research perspective it is not always desirable that the content of A is further processed before AA' (e.g. when using VSL).

0 Kudos
McCalpinJohn
Honored Contributor III
672 Views

I have tried to get the compiler to replicate the optimizations used in Intel's DGEMM implementation, but have had no luck.

Loop blocking for linear algebra codes often have three levels: register blocking, L2 cache blocking, and L3 cache (or TLB) blocking.   The purpose of register blocking is to expose adequate concurrency, the purpose of L2 cache blocking is to increase the number of arithmetic operations for each data item loaded (or stored), and the purpose of L3 (or TLB) blocking is partly to increase the number of arithmetic operations per data transfer to/from memory and partly to decrease the latency of the required memory accesses.  

Register blocking has to have enough partial sums to keep the pipelines full.  For a Haswell core there are 2 FMA units with 5 cycle latency, so a minimum of 10 partial sums is needed.  To get full performance each of the partial sums must be a 256-bit SIMD vector, which will contain 4 64-bit elements for double precision, 8 32-bit elements for single precision, etc.

For double-precision matrix multiplication written in "accumulate" form:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         c += a * b;
      }
   }
}

my analysis suggests that the optimum register blocking is to unroll the "i" loop by 4, unroll the "j" loop by 12, and unroll the "k" loop by 4.   The number of partial sums is the product of the unrolling of the "i" and "j", but the "j" direction is vectorized, so this ends up using 12 (4*12/4) registers.   For float data everything is the same except that the "j" loop is unrolled by 24 instead of 12.  This still fits in 3 registers, and the latencies are all the same.  

Some discussions of my efforts to get the compiler to generate optimal code are at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/545040

Blocking for Level-1 cache is not usually done -- it is just too small to be useful.

Blocking for Level-2 cache is the first size that is generally useful.  For matrix multiplication you want to hold blocks of each of the three matrices, so the maximum size of each block is about 104x104 double-precision elements with a standard 256KiB L2 cache.  Lots of codes round this down to 96x96.  Square is not necessarily the best answer -- 192x48 may provide better cache and TLB utilization.  A block size of 100x100 reduces the amount of memory traffic by a factor of 50 relative to the original algorithm above.

Blocking for the Level-3 cache is not always done, but it is getting more common as memory hierarchies get deeper and more unbalanced. (I.e., sometimes reducing the memory traffic by a factor of 50 is not enough!) Most codes are set up to have each core work on a completely separate block, so the issue is the amount of L3 cache per core.  The server processors generally have 2.5 MiB of L3 cache per core.  For a variety of reasons you don't want to try to use all of the L3 cache -- ~2/3 of the available cache is reasonable.  An example would be three blocks of 256x256 elements each for each core.  An alternate approach to use the whole L3 cache for one block and parallelize the processing of the block across the cores.  This allows a bigger block size of somewhere in the range of 1000x1000 on chips with 25MiB or larger L3 caches.

For very large caches you have to worry about the reach of the TLB because some of the memory accesses have a large stride.  On Sandy Bridge the level 2 TLB can map 512 4KiB pages, or 2MiB.  This is slightly less than the size of the core's share of the L3 cache (2.5 MiB/core).   Haswell increases the level 2 TLB to 1024 entries, so each core can map 4MiB with the default 4KiB pages.   Using large (2MiB) pages may eliminate this issue, especially on Haswell where the STLB can also hold large page mappings.

0 Kudos
Petros_Mamales
Beginner
672 Views

Maybe you should be using ?syrk ?

0 Kudos
Reply