- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all, I'm new to the community and have searched around, but still hoping maybe someone has some insight on MKL's single-precision, floating-point complex matrix-vector multiply (cgemv) implementation. Essentially, I want to replicate its algorithm but with the int16_t data type instead of float.
I'm hoping to increase the speed of cgemv by implementing my own version using a fixed point, int16 data type with AVX 512 SIMD intrinsics. The idea is with a 16-bit data type (int16_t) vs a 32-bit data type (float), there will be 2x more data-level parallelism and execute faster, with still enough precision for my use case (signal processing for 5G MU-MIMO).
Currently, the MKL's JIT-compiled cgemm is the fastest implementation I've benchmarked for matrix-vector multiplication. When I look at the assembly of a call to normal (non-JIT) cblas_cgemv, I found what looks like the AVX 512 implementation, <mkl_blas_avx512_xcgemv>, which is ~2919 lines long and full of vaddps, vmulps, vshufps, and vfmaddsub instructions -- the last one, fused multiply alternate add subtract seems to be useful for complex multiply when the complex numbers are stored in an interleaved format in memory, i.e. real, imag, real, imag... (http://cacs.usc.edu/education/cs653/Popovici-ComplexSIMD-HPEC17.pdf#page=2)
Is anyone familiar with MKL's cgemv implementation and does this seem like a good idea? Thanks so much in advance!
Assembly instructions breakdown for <mkl_blas_avx512_xcgemv>: https://docs.google.com/spreadsheets/d/17VSrOo5CGGkcxz_wn_xkYJC43rYAaKuOvDdw0RzGFbA/edit?usp=sharing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @brandon, thanks for your question -- glad to hear MKL JIT cgemm is working well for you. I think the performance difference between integer and floating point will depend in some degree on your problem size.
In general, GEMV is a memory-bound operation, meaning that a good implementation's performance will be determined by cache or main memory bandwidth, rather than FMA throughput. Of course, if your GEMV is so small all the data can be already resident in registers before the computation, this is not a consideration.
Given that GEMV is memory-bound, using the smaller int16 data types will save on memory bandwidth. The underlying int16 FMA instruction is VPMADDWD (intrinsic: _mm512_madd_epi16), which accumulates in int32. This means keeping your intermediate results as int32 (which you probably want anyway for better accuracy) before converting back to your int16 fixed point format.
Note that VPMADDWD performs two FMAs per int32 vector slot, which is actually nice for doing complex multiplication. To accommodate the input layout for VPMADDWD, I'd strongly recommend interleaving real and imaginary parts (the usual format for cgemm/cgemv). Also, for best performance I would recommend storing the matrix column-major to avoid horizontal reductions in the vector registers at the completion of the GEMV.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL is the proprietary library and we couldn't share the implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply @Gennady_F_Intel. In your opinion, would it still make sense to pursue an integer implementation? In particular, do you suspect the benefit of using the floating-point FMA units is greater than the savings of switching to integer multiply, add instructions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @brandon, thanks for your question -- glad to hear MKL JIT cgemm is working well for you. I think the performance difference between integer and floating point will depend in some degree on your problem size.
In general, GEMV is a memory-bound operation, meaning that a good implementation's performance will be determined by cache or main memory bandwidth, rather than FMA throughput. Of course, if your GEMV is so small all the data can be already resident in registers before the computation, this is not a consideration.
Given that GEMV is memory-bound, using the smaller int16 data types will save on memory bandwidth. The underlying int16 FMA instruction is VPMADDWD (intrinsic: _mm512_madd_epi16), which accumulates in int32. This means keeping your intermediate results as int32 (which you probably want anyway for better accuracy) before converting back to your int16 fixed point format.
Note that VPMADDWD performs two FMAs per int32 vector slot, which is actually nice for doing complex multiplication. To accommodate the input layout for VPMADDWD, I'd strongly recommend interleaving real and imaginary parts (the usual format for cgemm/cgemv). Also, for best performance I would recommend storing the matrix column-major to avoid horizontal reductions in the vector registers at the completion of the GEMV.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Peter_C_Intel, I really appreciate your thoughtful reply. My problem size is generally an MxK matrix where either M or K is 64 and the other is less than or equal to 64, so I imagine that would be enough data where the smaller data type would save on memory bandwidth, right?
Your tip and explanation of VPMADDWD /_mm512_madd_epi16 was hugely helpful. I was just looking at the AVX-512 VNNI instructions here https://en.wikichip.org/wiki/x86/avx512_vnni but I realize now I wasn't thinking of them in the right way. VPDPWSSD might actually be helpful too. I've already been sticking with interleaved, column-major layout for the reasons you mention so I hope that switching to the fused int16 operations will speed up my code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I think your problem and data layout looks like a good candidate for acceleration with int16. If you have the AVX512-VNNI instruction set available on your processor, definitely go with VPDPWSSD to save a separate VPADDD.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter_C_Intel, I'm running into an error where it says the intrinsic isn't declared. Is it possible that I need another compiler flag to enable VNNI? The intrinsic seems to be present in my avx512vnniintrin.h file. My processor is Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, Cascade Lake, which looks like it supports VNNI as far as I can tell.
matvec_int16.cpp:119:9: error: ‘_mm512_dpwssds_epi32’ was not declared in this scope
The same goes for _mm512_dpwssd_epi32. Thanks so much in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@brandon, compiler flag would be my first guess. Which compiler are you using?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter_C_Intel, I compile with this command
g++ -m64 -I/usr/local/include/Zydis -I${MKLROOT}/include -o matvec matvec_int16.cpp -std=c++17 -Wl,--no-as-needed -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group,--no-as-needed -lpthread -lm -ldl /usr/local/lib/libZydis.a -O3 -march=native
And my g++ version is below:
g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems gcc 7.5 is too old -- can you try gcc 8.1 or a later version?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That did the trick, thanks so much. Once I get things working fingers crossed the performance is there!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Peter_C_Intel, thanks so much for your advice earlier. Just wanted to pass along some updates. With my int16 implementation using Intel intrinsics, I do see some speed improvement for (64x16) * (16x1)
100000 iterations, (64x16) * (16x1)
MKL JIT cgemm: 0.0422210898 µs per iteration
int16 matvec: 0.0341402902 µs per iteration
1.24x MKL JIT cgemm
This speed increase widens for K values >16 with M fixed at 64, which makes sense with the larger number of memory operations needed. However, for a 16x64 matrix size, my performance is slower than MKL.
100000 iterations, (16x64) * (64x1)
MKL JIT cgemm: 0.0526496896 µs per iteration
int16 matvec: 0.0847489305 µs per iteration
0.62x MKL JIT cgemm
Oddly, this gap does not improve when K>64 and M is fixed at 16. So, just to update you, I'm trying to get finer-grained control of the assembly instructions for each kernel by using Xbyak to generate my code at runtime, which I saw in some slides is what Intel's JIT cgemm uses under the hood.
If you might have any insight as to why the 16x64 matrix size may be slower than 64x16, and/or advice for Xbyak, I would definitely really appreciate it!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @brandon -- thanks for the update! Glad to hear you were able to see some speedup for the 64x16 case.
For your 16x64 case, it's hard to say too much without looking at the generated assembly from the compiler. Actually, for that size, there's not a lot of flexibility in how you can implement gemv -- you have only one loop (over k) and no data reuse -- but you might try experimenting with the timing of the matrix and vector loads. Xbyak would definitely come in handy here so you can control these more carefully.
If you've coded in x64 assembly before, Xbyak should be fairly straightforward. If not, you might start by translating the compiler-generated assembly for your gemv code into Xbyak syntax. Make sure to generate Intel/MASM style assembly (for gcc, use the -masm=intel flag), since that's the basic format Xbyak follows.
I would have wondered if you were seeing overhead in the final fixed point downconversion from int32 to int16, but as you mentioned the situation doesn't change for larger k and you already are seeing speedup for 64x16 that's definitely not the case.
- Tags:
- Hi @
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter_C_Intel , thanks for such a speedy response. You're totally right, I didn't have too many options for implementing the 16x64 case for the reasons you mentioned.
Downcasting int32 to int16 did introduce overhead, and I ended up using these intrinsics to basically truncate, interleave, then store the values. real_res and imag_res are the __m512i deinterleaved int32 results after computation. res is the output pointer (interleaved int16)
_mm512_store_si512(res, _mm512_mask_blend_epi16(0xAAAAAAAA, real_res, _mm512_slli_epi32(imag_res, 16)));
I first had tried this which accomplished the same thing, but the two memory stores per vector seemed to be slightly costlier than the above version.
_mm512_mask_storeu_epi16((void*)(res), 0x55555555, real_res);
_mm512_mask_storeu_epi16((void*)((uintptr_t)res+2), 0x55555555, imag_res);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Peter_C_Intel, so some good news was that I was able to use Xbyak to implement cgemv as I wanted, and I also figured out what was holding back my 16x64 kernel. Essentially, it was that for small number of rows (i.e. 16) I was only using one accumulator, so there was a data dependency with my VPDPWSSD instruction so I wasn't reaching max throughput. I solved it by using multiple accumulators then accumulating the final result with VPADDD's at the end. With int16_t storing integers my code runs 1.5x - 5.5x MKL for problem sizes MxK where M < 208 and a multiple of 16 and K is any number.
However, I realize in hindsight that making my integer code work with fixed-point representation might be troublesome? For example, one thing I'm thinking is that after each multiplication, I would need to divide the result by 2^(# of fractional bits) in order to scale my result back. That would do 2 things:
- break up all of my VPDPWSSD fused multiply adds into VPMADDWD and VPADDD, and
- require shifts or integer division in the vector registers (which I'm having trouble getting right)
Both of these things would slow my code down significantly, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @brandon, great to hear you're achieving good speedups!
I'd definitely advise doing all the accumulations in 32-bits with vpdpwssd as usual, and only converting down to your original 16-bit fixed point format at the end of all the accumulations. That way you only need m downconversions, and the result will also be more accurate.
Without knowing the exact number of fractional bits in your fixed-point representation, I think you can do the downconversion with vpsrad (to shift out the extra fractional bits) + vpackssdw (to reduce to 16 bits).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter_C_Intel, just have to say thank you again -- your advice has been so so helpful. Those two instructions were just what I was looking for, then I used vpshufb to shuffle and interleave the values correctly at the end. Now for my particular problem,
Would you happen to know how to use global variables or fields inside assembly instruction in an Xbyak code generator?
To be a bit clearer (hopefully), I have some static data that I use in my JIT code generator, and I'm not sure how I can access it other than by passing in void* pointers of the data as parameters to the generated kernel so that those pointers are then in registers which I can access with assembly instructions. That's what I've been doing so far, but ideally, since this static data never changes, I would rather the user not have to pass them in as parameters every time (abstracting implementation details away).
For example, I notice in MKL jit_cgemm, in the first few lines of generated assembly, an instruction broadcasts {1.0 -1.0} from memory into a vector register, and that address of the struct was just written directly and wasn't inside a register, so I know it wasn't passed in as a parameter like I'm doing right now. I'm just now sure how to do it like that, and what I've tried so far seems to get syntax or runtime errors. I'm totally new to assembly programming and Xbyak, so any advice is appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @brandon,
Good question. x64 assembly offers RIP-relative memory addressing, which makes it easy to mix constant data with your code. In Xbyak, you'd do something like this:
Xbyak::Label my_data;
// ...
auto data_start = getCurr();
L(my_data);
dd(0x1234); // 32-bit value; there's also db/dw/dq -- db can also handle data of arbitrary size
dd(0x5678);
auto code_start = getCurr();
// asm code here
mov(ebx, dword_ptr[rip + my_data]);
After you're done the function pointer will of course point to the beginning of the data, so you'll want to advance the address by (code_start - data_start). Or you can put the constant data at the end.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Peter_C_Intel, thanks again, that was a very clear explanation of something that eluded me for a while now.
Is it possible to use dynamically allocated memory with Xbyak? I'm trying to use aligned_alloc() / free() and the example from here to allow the user of my code to generate the kernel code and then destroy it later with an easy to use API like MKL JIT gemm does but I must be doing something wrong. Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @Peter_C_Intel , it's been a while, but I just wanted to loop back in about this question!
Right now, I use my CodeGenerator like this:
And I noticed that I cannot instantiate two different instances of my CodeGenerator class for different m and k values within the same function. I'm not quite sure why. The way I get around this is just by making sure any given function only generates maximum one JitInt16MatVec.
Ideally, I would want to have an API to create the matvec kernel and destroy it. Might you be able to give any insight on how one could do that with Xbyak? Thanks so much in advance!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page