Showing results for 
Search instead for 
Did you mean: 

Intel MKL blas_cgemv implementation?

Hi all, I'm new to the community and have searched around, but still hoping maybe someone has some insight on MKL's single-precision, floating-point complex matrix-vector multiply (cgemv) implementation. Essentially, I want to replicate its algorithm but with the int16_t data type instead of float.

I'm hoping to increase the speed of cgemv by implementing my own version using a fixed point, int16 data type with AVX 512 SIMD intrinsics. The idea is with a 16-bit data type (int16_t) vs a 32-bit data type (float), there will be 2x more data-level parallelism and execute faster, with still enough precision for my use case (signal processing for 5G MU-MIMO).

Currently, the MKL's JIT-compiled cgemm is the fastest implementation I've benchmarked for matrix-vector multiplication. When I look at the assembly of a call to normal (non-JIT) cblas_cgemv, I found what looks like the AVX 512 implementation, <mkl_blas_avx512_xcgemv>, which is ~2919 lines long and full of vaddps, vmulps, vshufps, and vfmaddsub instructions -- the last one, fused multiply alternate add subtract seems to be useful for complex multiply when the complex numbers are stored in an interleaved format in memory, i.e. real, imag, real, imag... (

Is anyone familiar with MKL's cgemv implementation and does this seem like a good idea? Thanks so much in advance!

Assembly instructions breakdown for <mkl_blas_avx512_xcgemv>:



0 Kudos