mkl_sparse_s_mm slower for BSR format than for CSR

bozavlado · ‎12-15-2020

I am testing sparse matrix multiplication with BSR format and found that it is 3x slower than using CSR format (e.g. for matrices of shape 256x256 and sparse matrix with block size 4 and 4096 nonzero entries). I expected, that BSR format is faster than CSR (with the same amount of nonzero entries).

I am compiling code using (I tried icpx with same results):

`g++ -o sparse_bsr_simp sparse_bsr_simp.cpp -O3 -march=native -DMKL_LP64 -m64 -I/opt/intel/oneapi/mkl/2021.1.1//include -Wl,--start-group /opt/intel/oneapi/mkl/2021.1.1//lib/intel64/libmkl_intel_lp64.a /opt/intel/oneapi/mkl/2021.1.1//lib/intel64/libmkl_sequential.a /opt/intel/oneapi
/mkl/2021.1.1//lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl`

And running via:

`./sparse_bsr_simp 256 256 4096 4`

With BSR format benchmark runs in 0.13s, with CSR format it run in 0.044s.
(this can be swapped by uncomenting correct convert function in the attached code).

What am I doing wrong?

Gennady_F_Intel · ‎12-15-2020

What is the CPU type?

bozavlado · ‎12-15-2020

Sorry, I forgot to include that and cannot include original post:

My CPU is: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz (this has AVX2)

Also same thing happens on: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (also has AVX2)

And also on Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (this has AVX512, but I have only 2020.1 MKL on that machine).

Are there any public benchmarks/guidelines for BSR matrix multiplication? Like what is good block_size, matrix sparsity to get even improvements over CSR?

Gennady_F_Intel · ‎12-15-2020

Thanks Vladimir, we will check.

Gennady_F_Intel · ‎12-16-2020

I see ~ similar numbers on my end :

$ icc -std=c++11 -mkl sparse_bsr_simp.cpp -o bsr.x

$ icc -std=c++11 -mkl sparse_csr_simp.cpp -o csr.x

$ echo $MKLROOT

/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0131839 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0132945 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0133272 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0332802 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0327158 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0337939 -5539.07

Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

We will check the problem and keep this thread informed.

-Gennady

Gennady_F_Intel · ‎12-16-2020

There is some perf gap when AVX-512 code branch has been choose:

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00720631 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00816685 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00833476 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.015087 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0148415 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0127416 -5539.07

CPU: 4 x Platinum 8286 2.9GHz

Gennady_F_Intel · ‎10-05-2021

Vladimir,

some improvements were done into MKL 2021.4 which is available for download.

Gennady_F_Intel · ‎10-09-2021

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

mkl_sparse_s_mm slower for BSR format than for CSR

Performance