topic Intel Math Kernel Library Cblas int8 gemm and dnnl int8 gemm in Intel® oneAPI Math Kernel Library

Intel Math Kernel Library Cblas int8 gemm and dnnl int8 gemm

jingjing__wang — Thu, 05 Dec 2019 07:59:45 GMT

Hello,

I have some questions on cblas_gemm_s8u8s32.

1. What is the reasoning behind requiring one side to be signed and the other unsigned?

2. When I do matrix multiplication with cblas_gemm_s8u8s32 function, I find that when the column major and the second operator( the unsigned int8 integer value) exceeds 128, the calculation result is wrong. What is the reason? How do I calculate the multiplication of two signed int8 matrices.

3. I tried to use MKLDNN DNNL dnnl_gemm_s8s8s32, but unfortunately, but unfortunately, it was much slower than MKL's cblas_sgemm function on some scales.

4. I tested the efficiency of int8 GEMM (Use cblas_gemm_s8u8s32) and float GEMM on my machine and found that the speed of int8 GEMM is close to float. Why? Do you have the efficiency test results of two interfaces?

Thanks,

Jingjing Wang

Hello Jingjing,

Aaron_J_Intel2 — Fri, 06 Dec 2019 18:00:00 GMT

Hello Jingjing,

The reason for signed and unsigned has to do with the AVX 512 VNNI hardware instruction set underneath the software interface. For example, using vpdpbusd [1] instead of vpmaddubsw, vpmaddwd, and vpaddd.

Could you provide more information about particular matrix sizes you are interested in testing?

Even better, it would help expedite if you could provide a consise reproducer, application source code with minimal dependencies, for each issue; 2, 3 and 4.

Thank you for your good questions about cblas_gemm_s8u8s32!

Aaron

[1] https://www.intel.ai/vnni-enables-inference/

Here are two discussions that

Aaron_J_Intel2 — Fri, 06 Dec 2019 18:47:53 GMT

Here are two discussions that may shed light on your questions.

Incorrect result of s8s8s32 gemm? https://github.com/intel/mkl-dnn/issues/476

Best instruction set for s8s8s32 gemm ? https://github.com/intel/mkl-dnn/issues/532

Let me know if you have further questions or a reproducer,

Aaron

Hi Jingjing,

Peter_C_Intel — Fri, 06 Dec 2019 19:22:12 GMT

Hi Jingjing,

For #3 and #4, can you also provide information on the CPU you used when checking performance? If you're running on an AVX2 machine, then the performance behavior you're seeing is expected.

Best,

Peter

Quote:Caday, Peter (Intel)

jingjing__wang — Sat, 07 Dec 2019 02:11:47 GMT

Caday, Peter (Intel) wrote:
Hi Jingjing,
For #3 and #4, can you also provide information on the CPU you used when checking performance? If you're running on an AVX2 machine, then the performance behavior you're seeing is expected.
Best,
Peter

Thank you for your reply. I check the performance on Intel Xeon CPU E5-2667 v3 @3.2GHz, may be it only support AVX2.

That is to say, dnnl Int8 gemmed will only perform better when it supports AVX512 or higher instruction sets?

Hi Jingjing,

Arthur_A_Intel — Tue, 17 Dec 2019 18:37:25 GMT

Hi Jingjing,

We recently add support for avx2 in DNNL for int8 gemm (around end of November, check this commit 35b39a8d). Anyways, performance of int8 vs single precision shouldn't be much better on avx2 platform.