topic Re: mkl_sparse_s_mm slower for BSR format than for CSR in Intel® oneAPI Math Kernel Library

mkl_sparse_s_mm slower for BSR format than for CSR

bozavlado — Tue, 15 Dec 2020 12:46:30 GMT

I am testing sparse matrix multiplication with BSR format and found that it is 3x slower than using CSR format (e.g. for matrices of shape 256x256 and sparse matrix with block size 4 and 4096 nonzero entries). I expected, that BSR format is faster than CSR (with the same amount of nonzero entries).

I am compiling code using (I tried icpx with same results):

`g++ -o sparse_bsr_simp sparse_bsr_simp.cpp -O3 -march=native -DMKL_LP64 -m64 -I/opt/intel/oneapi/mkl/2021.1.1//include -Wl,--start-group /opt/intel/oneapi/mkl/2021.1.1//lib/intel64/libmkl_intel_lp64.a /opt/intel/oneapi/mkl/2021.1.1//lib/intel64/libmkl_sequential.a /opt/intel/oneapi
/mkl/2021.1.1//lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl`

And running via:

`./sparse_bsr_simp 256 256 4096 4`

With BSR format benchmark runs in 0.13s, with CSR format it run in 0.044s.
(this can be swapped by uncomenting correct convert function in the attached code).

What am I doing wrong?

Re: mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Tue, 15 Dec 2020 18:45:08 GMT

What is the CPU type?

Re: mkl_sparse_s_mm slower for BSR format than for CSR

bozavlado — Tue, 15 Dec 2020 19:23:47 GMT

Sorry, I forgot to include that and cannot include original post:

My CPU is: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz (this has AVX2)

Also same thing happens on: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (also has AVX2)

And also on Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (this has AVX512, but I have only 2020.1 MKL on that machine).

Are there any public benchmarks/guidelines for BSR matrix multiplication? Like what is good block_size, matrix sparsity to get even improvements over CSR?

Re:mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Wed, 16 Dec 2020 03:18:21 GMT

Thanks Vladimir, we will check.

Re:mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Wed, 16 Dec 2020 08:07:33 GMT

I see ~ similar numbers on my end :

$ icc -std=c++11 -mkl sparse_bsr_simp.cpp -o bsr.x

$ icc -std=c++11 -mkl sparse_csr_simp.cpp -o csr.x

$ echo $MKLROOT

/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0131839 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0132945 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0133272 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0332802 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0327158 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0337939 -5539.07

Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

We will check the problem and keep this thread informed.

-Gennady

Re:mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Wed, 16 Dec 2020 11:20:49 GMT

There is some perf gap when AVX-512 code branch has been choose:

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00720631 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00816685 -5539.07

$ ./csr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.00833476 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.015087 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0148415 -5539.07

$ ./bsr.x 256 256 4096 4

blocksparse 4 256 256 4096 0.0127416 -5539.07

CPU: 4 x Platinum 8286 2.9GHz

Re:mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Tue, 05 Oct 2021 09:24:01 GMT

Vladimir,

some improvements were done into MKL 2021.4 which is available for download.

Re:mkl_sparse_s_mm slower for BSR format than for CSR

Gennady_F_Intel — Sun, 10 Oct 2021 05:59:47 GMT

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.