Increasing block size makes BSR sparse multiplication slower

bozavlado · ‎09-01-2022

Followup on: https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-sparse-s-mm-slower-for-BSR-format-than-for-CSR/m-p/1237472

I am using exactly same code as in mentioned post. BSR is now faster than CSR, which is great. At least for block size 8.

But tt seems that increasing BSR block size from 8 to 16 leads to much slower performance.

E.g. when running (1024x1024 sparse matrix with 131072 nonzeros with block size 8, multiplied by 1024x256 matrix):

`./sparse_bsr_simp 256 1024 131072 8` it takes 0.8s

But running (same but block size 16):

`./sparse_bsr_simp 256 1024 131072 16` it takes 3s

Also it does not seem to be an input issue. When using input with 16x16 blocks, but setting block size to 8 when converting to BSR, whole matmul runs in 0.8s.

This is quite strange.

Currently used CPU is:

Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz

But this also happens in older avx2 cpu:

Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

MKL version is 2022.1.0 (tried installing today, installer installed 2022.1 version).

Gajanan_Choudhary · ‎09-01-2022

Hello,

Thanks for reaching out to us about this. I'd like to know the following (and I may have follow-up questions later):

1. Is your application going to use block sizes that are only powers of 2 (4, 8, 16, 32, etc.) and nothing else?

2. What is the range of matrix sizes your application uses (is it relatively small, like tens to hundreds of rows, or thousands/much larger number of rows)? Would mind running your reproducer/tests for BSR matrices with larger number of rows (maybe 10000+) and let us know if it happens for that situation? My rationale for asking that is the following:

For a fixed, small number of rows in the matrix (like the 256 I think you are using), there is less parallelism available for larger BSR block sizes compared to smaller ones even for the same matrix. Here's an example: On a Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz that you are using, assuming you have a 2-socket machine, you would have up to 48 threads available for use. For a BSR matrix with 256 rows, there are 256/block_size "block rows" in the matrix (and its BSR representation). For block sizes 8 and 16, that translates to 32 and 16 block rows, respectively. With each thread assigned to 1 block row, 32 threads perform the BSR MM for block size 8 compared to just 16 threads for block size 16. This becomes especially pronounced if you are comparing the same MM operation with a CSR matrix, which would launch and utilize all 44 threads for it. For matrices with small number of rows, it would not be surprising to me that you are running into this issue. If your BSR matrix has at least block_size*num_threads rows in it, you would see the full parallelism of the machine being used. That's why it is important to know if your application uses matrices with "small" or "large" number of rows.

Regards,

Gajanan Choudhary

Developer in the oneMKL team

bozavlado · ‎09-01-2022

Hello,

my use case is mostly around neural networks.

That means sparse matrix has shapes between 256-4096 (mostly power of 2, sometimes things like 768), mostly squares, sometimes not.

The last dimension can be tuned accordingly to the MKL library (e.g. if 256 gives best performance I will use that).

Also I am compiling using:

g++ -o sparse_bsr_simp sparse_bsr_simp.cc -O3 -march=native -DMKL_LP64 -m64 -I/opt/intel/oneapi/mkl/2022.1.0//include -Wl,--start-group /opt/intel/oneapi/mkl/2022.1.0//lib/intel64/libmkl_intel_lp64.a /opt/intel/oneapi/mkl/2022.1.0//lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2022.1.0//lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -I ../libxsmm/include -DNDEBUG --std=c++17

And setting `export OMP_NUM_THREADS=1` (although it should not matter here).

So I am always targeting one core.

As for the block size, desired values are mostly powers of two like 8, 16 and maybe 32. If other multiplies of 8 or 16 work, I would be great, but it is not mandatory.

The main surprising thing for me is that increasing block size degrades performance, which does not make any sense in my opinion.
(And also that using same sparse matrix with just smaller block size leads to increased performance).

Gajanan_Choudhary · ‎09-01-2022

@bozavlado wrote:

The main surprising thing for me is that increasing block size degrades performance, which does not make any sense in my opinion.
(And also that using same sparse matrix with just smaller block size leads to increased performance).

So if you were running the MM in parallel using the full set of threads on the machine, I would not be surprised with the degraded performance for the small block sizes and small matrix sizes as I explained in my previous comment (because for your sizes, block_size=8 would have had double the parallelism available than block_size=16 for the same matrix).

@bozavlado wrote:

And setting `export OMP_NUM_THREADS=1` (although it should not matter here).

So I am always targeting one core.

However, since you are setting OMP_NUM_THREADS=1 and linking to libmkl_sequential.a (instead of libmkl_intel_thread.a), which should make the BSR MM run only on a single thread, this slow behavior with block size 16 compared to 8 is unexpected. Please give us some time to reproduce the behavior on our end and report back with a confirmation or further questions.