Hi Jingjing,

jingjing__wang · ‎12-04-2019

Hello,

I have some questions on cblas_gemm_s8u8s32.

1. What is the reasoning behind requiring one side to be signed and the other unsigned?

2. When I do matrix multiplication with cblas_gemm_s8u8s32 function, I find that when the column major and the second operator( the unsigned int8 integer value) exceeds 128, the calculation result is wrong. What is the reason? How do I calculate the multiplication of two signed int8 matrices.

3. I tried to use MKLDNN DNNL dnnl_gemm_s8s8s32, but unfortunately, but unfortunately, it was much slower than MKL's cblas_sgemm function on some scales.

4. I tested the efficiency of int8 GEMM (Use cblas_gemm_s8u8s32) and float GEMM on my machine and found that the speed of int8 GEMM is close to float. Why? Do you have the efficiency test results of two interfaces?

Thanks,

Jingjing Wang

Aaron_J_Intel2 · ‎12-06-2019

Hello Jingjing,

The reason for signed and unsigned has to do with the AVX 512 VNNI hardware instruction set underneath the software interface. For example, using vpdpbusd [1] instead of vpmaddubsw, vpmaddwd, and vpaddd.

Could you provide more information about particular matrix sizes you are interested in testing?

Even better, it would help expedite if you could provide a consise reproducer, application source code with minimal dependencies, for each issue; 2, 3 and 4.

Thank you for your good questions about cblas_gemm_s8u8s32!

Aaron

[1] https://www.intel.ai/vnni-enables-inference/

Aaron_J_Intel2 · ‎12-06-2019

Here are two discussions that may shed light on your questions.

Incorrect result of s8s8s32 gemm? https://github.com/intel/mkl-dnn/issues/476

Best instruction set for s8s8s32 gemm ? https://github.com/intel/mkl-dnn/issues/532

Let me know if you have further questions or a reproducer,

Aaron

Peter_C_Intel · ‎12-06-2019

Hi Jingjing,

For #3 and #4, can you also provide information on the CPU you used when checking performance? If you're running on an AVX2 machine, then the performance behavior you're seeing is expected.

Best,

Peter

jingjing__wang · ‎12-06-2019

Caday, Peter (Intel) wrote:
Hi Jingjing,
For #3 and #4, can you also provide information on the CPU you used when checking performance? If you're running on an AVX2 machine, then the performance behavior you're seeing is expected.
Best,
Peter

Thank you for your reply. I check the performance on Intel Xeon CPU E5-2667 v3 @3.2GHz, may be it only support AVX2.

That is to say, dnnl Int8 gemmed will only perform better when it supports AVX512 or higher instruction sets?

Arthur_A_Intel · ‎12-17-2019

Hi Jingjing,

We recently add support for avx2 in DNNL for int8 gemm (around end of November, check this commit 35b39a8d). Anyways, performance of int8 vs single precision shouldn't be much better on avx2 platform.

Intel Math Kernel Library Cblas int8 gemm and dnnl int8 gemm