- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any way to use AVX512-FP16 instructions on Intel Sapphire Rapids Xeon CPUs via the GEMM routines in mkl_cblas.h? When I use "cblas_gemm_f16f16f32", my system uses AVX512-FP32 instructions, as verified by PCM. Is there any way to use lower-precision floats directly in MKL? I'm using MKL 2024.0.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it could be done automatically without any specific options.
if PCM doesn't recognize usage of avx512_bf16 instruction by SPR, it looks like a PCM's problem.
You might look at the main oneMKL product page and see the performance results of cblas_gemm_f16f16f32 routine.
Specifically - running this routine on my end on SPR ( lscpu | grep Mode : Model name: Intel(R) Xeon(R) Platinum 8480+ ), I see the following performance results:
export KMP_AFFINITY=granularity=fine,compact,1,0
size == 4000 v 4000, GEMM bf16 performance == 53314.2 ,GFlops
$ echo $MKLROOT/
/opt/intel/oneapi/mkl/2024.0/
You could see that ~54 TFlops is far beyond of f32 theoretical performance peak and it means that bf16 instructions has been used by default.
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
forget to add the verbose mode outputs just as an example:
...verbosing ....
MKL_VERBOSE oneMKL 2024.0 Product build 20231011 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support for INT8, BF16, FP16 (limited) instructions, and Intel(R) Advanced Matrix Extensions (Intel(R) AMX) with INT8 and BF16, Lnx 2.93GHz lp64 intel_thread
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 53.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.15ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.14ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.96ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
MKL_VERBOSE GEMM_BF16BF16F32(N,N,4000,4000,4000,0x7ffca87fa7b8,0x1490820cd080,4000,0x149083f52080,4000,0x7ffca87fa7c0,0x14907e3c3080,4000) 2.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:112
size == 4000, GEMM bf16 performance == 56674.9 ,GFlops
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page