FP16 GEMM using AVX512 on Sapphire Rapids

DerrickQuinn · ‎02-08-2024

Is there any way to use AVX512-FP16 instructions on Intel Sapphire Rapids Xeon CPUs via the GEMM routines in mkl_cblas.h? When I use "cblas_gemm_f16f16f32", my system uses AVX512-FP32 instructions, as verified by PCM. Is there any way to use lower-precision floats directly in MKL? I'm using MKL 2024.0.

Gennady_F_Intel · ‎02-13-2024

it could be done automatically without any specific options.

if PCM doesn't recognize usage of avx512_bf16 instruction by SPR, it looks like a PCM's problem.

You might look at the main oneMKL product page and see the performance results of cblas_gemm_f16f16f32 routine.

Specifically - running this routine on my end on SPR ( lscpu | grep Mode : Model name: Intel(R) Xeon(R) Platinum 8480+ ), I see the following performance results:

export KMP_AFFINITY=granularity=fine,compact,1,0

size == 4000 v 4000, GEMM bf16 performance == 53314.2 ,GFlops

$ echo $MKLROOT/

/opt/intel/oneapi/mkl/2024.0/

You could see that ~54 TFlops is far beyond of f32 theoretical performance peak and it means that bf16 instructions has been used by default.

--Gennady

Gennady_F_Intel · ‎02-13-2024

forget to add the verbose mode outputs just as an example:

...verbosing ....
MKL_VERBOSE oneMKL 2024.0 Product build 20231011 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support for INT8, BF16, FP16 (limited) instructions, and Intel(R) Advanced Matrix Extensions (Intel(R) AMX) with INT8 and BF16, Lnx 2.93GHz lp64 intel_thread

MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 53.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.15ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.14ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.96ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112

size == 4000, GEMM bf16 performance == 56674.9 ,GFlops