Solved: Intel MKL blas_cgemv implementation? - Page 2

brandon · ‎06-29-2020

Hi all, I'm new to the community and have searched around, but still hoping maybe someone has some insight on MKL's single-precision, floating-point complex matrix-vector multiply (cgemv) implementation. Essentially, I want to replicate its algorithm but with the int16_t data type instead of float.

I'm hoping to increase the speed of cgemv by implementing my own version using a fixed point, int16 data type with AVX 512 SIMD intrinsics. The idea is with a 16-bit data type (int16_t) vs a 32-bit data type (float), there will be 2x more data-level parallelism and execute faster, with still enough precision for my use case (signal processing for 5G MU-MIMO).

Currently, the MKL's JIT-compiled cgemm is the fastest implementation I've benchmarked for matrix-vector multiplication. When I look at the assembly of a call to normal (non-JIT) cblas_cgemv, I found what looks like the AVX 512 implementation, <mkl_blas_avx512_xcgemv>, which is ~2919 lines long and full of vaddps, vmulps, vshufps, and vfmaddsub instructions -- the last one, fused multiply alternate add subtract seems to be useful for complex multiply when the complex numbers are stored in an interleaved format in memory, i.e. real, imag, real, imag... (http://cacs.usc.edu/education/cs653/Popovici-ComplexSIMD-HPEC17.pdf#page=2)

Is anyone familiar with MKL's cgemv implementation and does this seem like a good idea? Thanks so much in advance!

Assembly instructions breakdown for <mkl_blas_avx512_xcgemv>: https://docs.google.com/spreadsheets/d/17VSrOo5CGGkcxz_wn_xkYJC43rYAaKuOvDdw0RzGFbA/edit?usp=sharing

Peter_C_Intel · ‎07-10-2020

Hi @brandon, thanks for your question -- glad to hear MKL JIT cgemm is working well for you. I think the performance difference between integer and floating point will depend in some degree on your problem size.

In general, GEMV is a memory-bound operation, meaning that a good implementation's performance will be determined by cache or main memory bandwidth, rather than FMA throughput. Of course, if your GEMV is so small all the data can be already resident in registers before the computation, this is not a consideration.

Given that GEMV is memory-bound, using the smaller int16 data types will save on memory bandwidth. The underlying int16 FMA instruction is VPMADDWD (intrinsic: _mm512_madd_epi16), which accumulates in int32. This means keeping your intermediate results as int32 (which you probably want anyway for better accuracy) before converting back to your int16 fixed point format.

Note that VPMADDWD performs two FMAs per int32 vector slot, which is actually nice for doing complex multiplication. To accommodate the input layout for VPMADDWD, I'd strongly recommend interleaving real and imaginary parts (the usual format for cgemm/cgemv). Also, for best performance I would recommend storing the matrix column-major to avoid horizontal reductions in the vector registers at the completion of the GEMV.

View solution in original post

Peter_C_Intel · ‎04-07-2021

Hi @brandon -- apologies for the very late reply! For some reason I didn't get an email notification of your last two messages.

To use Xbyak and allocate your own memory for the generated code, you can take a look at this section of the documentation.

Can you explain more about the problems (perhaps compiler errors?) you're seeing with creating multiple JitInt16MatVec objects in the same function?

brandon · ‎04-07-2021

Hi @Peter_C_Intel, I'm actually no longer actively working on this, but I truly appreciate you looping back with those resources and offering to help!

Thanks again for your help with Xbyak and MKL