Solved: Re: Intel MKL blas_cgemv implementation?

brandon · ‎06-29-2020

Hi all, I'm new to the community and have searched around, but still hoping maybe someone has some insight on MKL's single-precision, floating-point complex matrix-vector multiply (cgemv) implementation. Essentially, I want to replicate its algorithm but with the int16_t data type instead of float.

I'm hoping to increase the speed of cgemv by implementing my own version using a fixed point, int16 data type with AVX 512 SIMD intrinsics. The idea is with a 16-bit data type (int16_t) vs a 32-bit data type (float), there will be 2x more data-level parallelism and execute faster, with still enough precision for my use case (signal processing for 5G MU-MIMO).

Currently, the MKL's JIT-compiled cgemm is the fastest implementation I've benchmarked for matrix-vector multiplication. When I look at the assembly of a call to normal (non-JIT) cblas_cgemv, I found what looks like the AVX 512 implementation, <mkl_blas_avx512_xcgemv>, which is ~2919 lines long and full of vaddps, vmulps, vshufps, and vfmaddsub instructions -- the last one, fused multiply alternate add subtract seems to be useful for complex multiply when the complex numbers are stored in an interleaved format in memory, i.e. real, imag, real, imag... (http://cacs.usc.edu/education/cs653/Popovici-ComplexSIMD-HPEC17.pdf#page=2)

Is anyone familiar with MKL's cgemv implementation and does this seem like a good idea? Thanks so much in advance!

Assembly instructions breakdown for <mkl_blas_avx512_xcgemv>: https://docs.google.com/spreadsheets/d/17VSrOo5CGGkcxz_wn_xkYJC43rYAaKuOvDdw0RzGFbA/edit?usp=sharing

Peter_C_Intel · ‎07-10-2020

Hi @brandon, thanks for your question -- glad to hear MKL JIT cgemm is working well for you. I think the performance difference between integer and floating point will depend in some degree on your problem size.

In general, GEMV is a memory-bound operation, meaning that a good implementation's performance will be determined by cache or main memory bandwidth, rather than FMA throughput. Of course, if your GEMV is so small all the data can be already resident in registers before the computation, this is not a consideration.

Given that GEMV is memory-bound, using the smaller int16 data types will save on memory bandwidth. The underlying int16 FMA instruction is VPMADDWD (intrinsic: _mm512_madd_epi16), which accumulates in int32. This means keeping your intermediate results as int32 (which you probably want anyway for better accuracy) before converting back to your int16 fixed point format.

Note that VPMADDWD performs two FMAs per int32 vector slot, which is actually nice for doing complex multiplication. To accommodate the input layout for VPMADDWD, I'd strongly recommend interleaving real and imaginary parts (the usual format for cgemm/cgemv). Also, for best performance I would recommend storing the matrix column-major to avoid horizontal reductions in the vector registers at the completion of the GEMV.

View solution in original post

Gennady_F_Intel · ‎07-07-2020

MKL is the proprietary library and we couldn't share the implementation.

brandon · ‎07-09-2020

Thanks for your reply @Gennady_F_Intel. In your opinion, would it still make sense to pursue an integer implementation? In particular, do you suspect the benefit of using the floating-point FMA units is greater than the savings of switching to integer multiply, add instructions?

Peter_C_Intel · ‎07-10-2020

Hi @brandon, thanks for your question -- glad to hear MKL JIT cgemm is working well for you. I think the performance difference between integer and floating point will depend in some degree on your problem size.

In general, GEMV is a memory-bound operation, meaning that a good implementation's performance will be determined by cache or main memory bandwidth, rather than FMA throughput. Of course, if your GEMV is so small all the data can be already resident in registers before the computation, this is not a consideration.

Given that GEMV is memory-bound, using the smaller int16 data types will save on memory bandwidth. The underlying int16 FMA instruction is VPMADDWD (intrinsic: _mm512_madd_epi16), which accumulates in int32. This means keeping your intermediate results as int32 (which you probably want anyway for better accuracy) before converting back to your int16 fixed point format.

Note that VPMADDWD performs two FMAs per int32 vector slot, which is actually nice for doing complex multiplication. To accommodate the input layout for VPMADDWD, I'd strongly recommend interleaving real and imaginary parts (the usual format for cgemm/cgemv). Also, for best performance I would recommend storing the matrix column-major to avoid horizontal reductions in the vector registers at the completion of the GEMV.

brandon · ‎07-10-2020

Hi @Peter_C_Intel, I really appreciate your thoughtful reply. My problem size is generally an MxK matrix where either M or K is 64 and the other is less than or equal to 64, so I imagine that would be enough data where the smaller data type would save on memory bandwidth, right?

Your tip and explanation of VPMADDWD /_mm512_madd_epi16 was hugely helpful. I was just looking at the AVX-512 VNNI instructions here https://en.wikichip.org/wiki/x86/avx512_vnni but I realize now I wasn't thinking of them in the right way. VPDPWSSD might actually be helpful too. I've already been sticking with interleaved, column-major layout for the reasons you mention so I hope that switching to the fused int16 operations will speed up my code.

Peter_C_Intel · ‎07-10-2020

Yes, I think your problem and data layout looks like a good candidate for acceleration with int16. If you have the AVX512-VNNI instruction set available on your processor, definitely go with VPDPWSSD to save a separate VPADDD.

brandon · ‎07-10-2020

@Peter_C_Intel, I'm running into an error where it says the intrinsic isn't declared. Is it possible that I need another compiler flag to enable VNNI? The intrinsic seems to be present in my avx512vnniintrin.h file. My processor is Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, Cascade Lake, which looks like it supports VNNI as far as I can tell.

matvec_int16.cpp:119:9: error: ‘_mm512_dpwssds_epi32’ was not declared in this scope

The same goes for _mm512_dpwssd_epi32. Thanks so much in advance

Peter_C_Intel · ‎07-10-2020

@brandon, compiler flag would be my first guess. Which compiler are you using?

brandon · ‎07-10-2020

@Peter_C_Intel, I compile with this command

g++ -m64 -I/usr/local/include/Zydis -I${MKLROOT}/include -o matvec matvec_int16.cpp -std=c++17 -Wl,--no-as-needed -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group,--no-as-needed -lpthread -lm -ldl /usr/local/lib/libZydis.a -O3 -march=native

And my g++ version is below:

g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.

Peter_C_Intel · ‎07-10-2020

It seems gcc 7.5 is too old -- can you try gcc 8.1 or a later version?

brandon · ‎07-10-2020

That did the trick, thanks so much. Once I get things working fingers crossed the performance is there!

Peter_C_Intel · ‎07-10-2020

Great! Let us know how it works.

brandon · ‎07-20-2020

Hi @Peter_C_Intel, thanks so much for your advice earlier. Just wanted to pass along some updates. With my int16 implementation using Intel intrinsics, I do see some speed improvement for (64x16) * (16x1)

100000 iterations, (64x16) * (16x1)
MKL JIT cgemm: 0.0422210898 µs per iteration
 int16 matvec: 0.0341402902 µs per iteration
  1.24x MKL JIT cgemm

This speed increase widens for K values >16 with M fixed at 64, which makes sense with the larger number of memory operations needed. However, for a 16x64 matrix size, my performance is slower than MKL.

100000 iterations, (16x64) * (64x1)
MKL JIT cgemm: 0.0526496896 µs per iteration
 int16 matvec: 0.0847489305 µs per iteration
  0.62x MKL JIT cgemm

Oddly, this gap does not improve when K>64 and M is fixed at 16. So, just to update you, I'm trying to get finer-grained control of the assembly instructions for each kernel by using Xbyak to generate my code at runtime, which I saw in some slides is what Intel's JIT cgemm uses under the hood.

If you might have any insight as to why the 16x64 matrix size may be slower than 64x16, and/or advice for Xbyak, I would definitely really appreciate it!

Peter_C_Intel · ‎07-20-2020

Hi @brandon -- thanks for the update! Glad to hear you were able to see some speedup for the 64x16 case.

For your 16x64 case, it's hard to say too much without looking at the generated assembly from the compiler. Actually, for that size, there's not a lot of flexibility in how you can implement gemv -- you have only one loop (over k) and no data reuse -- but you might try experimenting with the timing of the matrix and vector loads. Xbyak would definitely come in handy here so you can control these more carefully.

If you've coded in x64 assembly before, Xbyak should be fairly straightforward. If not, you might start by translating the compiler-generated assembly for your gemv code into Xbyak syntax. Make sure to generate Intel/MASM style assembly (for gcc, use the -masm=intel flag), since that's the basic format Xbyak follows.

I would have wondered if you were seeing overhead in the final fixed point downconversion from int32 to int16, but as you mentioned the situation doesn't change for larger k and you already are seeing speedup for 64x16 that's definitely not the case.

brandon · ‎07-20-2020

@Peter_C_Intel , thanks for such a speedy response. You're totally right, I didn't have too many options for implementing the 16x64 case for the reasons you mentioned.

Downcasting int32 to int16 did introduce overhead, and I ended up using these intrinsics to basically truncate, interleave, then store the values. real_res and imag_res are the __m512i deinterleaved int32 results after computation. res is the output pointer (interleaved int16)

_mm512_store_si512(res, _mm512_mask_blend_epi16(0xAAAAAAAA, real_res, _mm512_slli_epi32(imag_res, 16)));

I first had tried this which accomplished the same thing, but the two memory stores per vector seemed to be slightly costlier than the above version.

_mm512_mask_storeu_epi16((void*)(res), 0x55555555, real_res);
_mm512_mask_storeu_epi16((void*)((uintptr_t)res+2), 0x55555555, imag_res);

brandon · ‎08-21-2020

Hi @Peter_C_Intel, so some good news was that I was able to use Xbyak to implement cgemv as I wanted, and I also figured out what was holding back my 16x64 kernel. Essentially, it was that for small number of rows (i.e. 16) I was only using one accumulator, so there was a data dependency with my VPDPWSSD instruction so I wasn't reaching max throughput. I solved it by using multiple accumulators then accumulating the final result with VPADDD's at the end. With int16_t storing integers my code runs 1.5x - 5.5x MKL for problem sizes MxK where M < 208 and a multiple of 16 and K is any number.

However, I realize in hindsight that making my integer code work with fixed-point representation might be troublesome? For example, one thing I'm thinking is that after each multiplication, I would need to divide the result by 2^(# of fractional bits) in order to scale my result back. That would do 2 things:

- break up all of my VPDPWSSD fused multiply adds into VPMADDWD and VPADDD, and

- require shifts or integer division in the vector registers (which I'm having trouble getting right)

Both of these things would slow my code down significantly, right?

Peter_C_Intel · ‎08-21-2020

Hi @brandon, great to hear you're achieving good speedups!

I'd definitely advise doing all the accumulations in 32-bits with vpdpwssd as usual, and only converting down to your original 16-bit fixed point format at the end of all the accumulations. That way you only need m downconversions, and the result will also be more accurate.

Without knowing the exact number of fractional bits in your fixed-point representation, I think you can do the downconversion with vpsrad (to shift out the extra fractional bits) + vpackssdw (to reduce to 16 bits).

brandon · ‎08-25-2020

@Peter_C_Intel, just have to say thank you again -- your advice has been so so helpful. Those two instructions were just what I was looking for, then I used vpshufb to shuffle and interleave the values correctly at the end. Now for my particular problem,

Would you happen to know how to use global variables or fields inside assembly instruction in an Xbyak code generator?

To be a bit clearer (hopefully), I have some static data that I use in my JIT code generator, and I'm not sure how I can access it other than by passing in void* pointers of the data as parameters to the generated kernel so that those pointers are then in registers which I can access with assembly instructions. That's what I've been doing so far, but ideally, since this static data never changes, I would rather the user not have to pass them in as parameters every time (abstracting implementation details away).

For example, I notice in MKL jit_cgemm, in the first few lines of generated assembly, an instruction broadcasts {1.0 -1.0} from memory into a vector register, and that address of the struct was just written directly and wasn't inside a register, so I know it wasn't passed in as a parameter like I'm doing right now. I'm just now sure how to do it like that, and what I've tried so far seems to get syntax or runtime errors. I'm totally new to assembly programming and Xbyak, so any advice is appreciated!

Peter_C_Intel · ‎08-26-2020

Hi @brandon,

Good question. x64 assembly offers RIP-relative memory addressing, which makes it easy to mix constant data with your code. In Xbyak, you'd do something like this:

Xbyak::Label my_data;
// ...

auto data_start = getCurr();
L(my_data);
dd(0x1234); // 32-bit value; there's also db/dw/dq -- db can also handle data of arbitrary size
dd(0x5678);

auto code_start = getCurr();
// asm code here
mov(ebx, dword_ptr[rip + my_data]);

After you're done the function pointer will of course point to the beginning of the data, so you'll want to advance the address by (code_start - data_start). Or you can put the constant data at the end.

brandon · ‎08-27-2020

Hi @Peter_C_Intel, thanks again, that was a very clear explanation of something that eluded me for a while now.

Is it possible to use dynamically allocated memory with Xbyak? I'm trying to use aligned_alloc() / free() and the example from here to allow the user of my code to generate the kernel code and then destroy it later with an easy to use API like MKL JIT gemm does but I must be doing something wrong. Thanks in advance!

brandon · ‎09-20-2020

Hey @Peter_C_Intel , it's been a while, but I just wanted to loop back in about this question!

Right now, I use my CodeGenerator like this:

JitInt16MatVec jit16(m, k);
jit16.readyRE();
void (*matvec16)(const Complex_int16*, const Complex_int16*, Complex_int16*) = jit16.getCode<void (*)(const Complex_int16*, const Complex_int16*, Complex_int16*)>();
matvec16(mat16, vec16, res16);

And I noticed that I cannot instantiate two different instances of my CodeGenerator class for different m and k values within the same function. I'm not quite sure why. The way I get around this is just by making sure any given function only generates maximum one JitInt16MatVec.

Ideally, I would want to have an API to create the matvec kernel and destroy it. Might you be able to give any insight on how one could do that with Xbyak? Thanks so much in advance!