Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
7234 Discussions

FP16 GEMM using AVX512 on Sapphire Rapids

DerrickQuinn
Beginner
1,928 Views

Is there any way to use AVX512-FP16 instructions on Intel Sapphire Rapids Xeon CPUs via the GEMM routines in mkl_cblas.h? When I use "cblas_gemm_f16f16f32", my system uses AVX512-FP32 instructions, as verified by PCM. Is there any way to use lower-precision floats directly in MKL? I'm using MKL 2024.0.

0 Kudos
2 Replies
Gennady_F_Intel
Moderator
1,861 Views

it could be done automatically without any specific options.

if PCM doesn't recognize usage of avx512_bf16 instruction by SPR, it looks like a PCM's problem.


You might look at the main oneMKL product page and see the performance results of cblas_gemm_f16f16f32 routine.


Specifically - running this routine on my end on SPR ( lscpu | grep Mode : Model name:   Intel(R) Xeon(R) Platinum 8480+ ), I see the following performance results:

export KMP_AFFINITY=granularity=fine,compact,1,0

size == 4000 v 4000, GEMM bf16 performance == 53314.2 ,GFlops


$ echo $MKLROOT/

/opt/intel/oneapi/mkl/2024.0/


You could see that ~54 TFlops is far beyond of f32 theoretical performance peak and it means that bf16 instructions has been used by default.


--Gennady






0 Kudos
Gennady_F_Intel
Moderator
1,859 Views

forget to add the verbose mode outputs just as an example:

...verbosing ....
MKL_VERBOSE oneMKL 2024.0 Product build 20231011 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support for INT8, BF16, FP16 (limited) instructions, and Intel(R) Advanced Matrix Extensions (Intel(R) AMX) with INT8 and BF16, Lnx 2.93GHz lp64 intel_thread

MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 53.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.15ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.14ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.96ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112

size == 4000, GEMM bf16 performance == 56674.9 ,GFlops

 

 

0 Kudos
Reply