AMX extensions use in bfloat16 matrix multiplication


I have access to a sapphire rapids machine and I want to multiply two bfloat16 matrices A and B and compute C = A*B by exploiting AMX_BF16 extensons. I am happy with C being stored in single precision. What is the recommended way of doing this with current Intel software resources?


OPTION 1: Directly using intel intrinsics by following . However, the author of this code example states that this should not be used as a basis for production code and that it was only made for demostration purposes.


OPTION 2: Intel MKL cblas_gemm_bf16bf16f32:


OPTION 3: Let the compiler handle it. I tried compiling a matrix-matrix multiplication example such as the following



#include <stdlib.h>
#include <stdio.h>

int main(){

   __bf16 A[1024];
   __bf16 B[1024];
   float C[256]={};

   for(int i=0; i<1024; i++){
       A[i] = 1.;
       B[i] = 2.;

   for (int k=0; k < 64; k++){
       for (int i=0; i < 16; i++){
           for (int j=0; j < 16; j++){
               C[i*16 + j] += ((float) A[i*64 + k])*((float) B[j*64 + k]);

   for (int i = 0; i < 16; i++){
     for (int j = 0; j < 16; j++){
         printf("%f ", (float) C[i*16 + j]);



with gcc -march=sapphirerapids -O3 -mamx-bf16 (icx compiler just crashes due to a bug). However, looking at the generated assembly code on it does not seem like AMX instructions are used at all.

Your issue has been escalated to our engineers and we'll work on it internally and reply you the update when we have a solution. 

Did you encounter a compiler error or a runtime error?

The code compiled successfully with oneAPI2024.1.0


$ icx -march=sapphirerapids -O3 -mamx-bf16 t3.c -c -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.



Hi There! I had used oneAPI2024.0.2, but the bug was fixed in oneAPI2024.1.0 and it now compiles fine which is good.


However, this is not my question: The latest icx does not use AMX-BF16 (nor AVX512-BF16) assembly instructions when compiling the above code. You can check with the -S compiler flag that the instructions tdpbf16ps or vdpbf16ps DO NOT come up indicating that neither AMX-BF16 nor AVX512-BF16 instructions are used by the compiler.


It might be me, but I am finding it impossible (with oneAPI2024.1.0 and the latest gcc and llvm compilers) to use these BF16 instruction sets (AMX-BF16 and AVX512-BF16) without directly calling Intel intrinsics. The problem with the intrinsics is that it is very hard for non-experts to achieve performance that is competitive with compiler-generated instructions. If there were an easier way it would be extremely interesting to know!


Thanks a lot!

The compiler should be able to generate AMX-BF16 and AVX512-BF16 instructions with -march=sapphirerapids.

I used your test case posted on the other thread and see this vcvtneps2bf16

$ cat t2.c
#include <stdlib.h>
#include <stdio.h>
#define M 4
__bf16 bf16test(__bf16* a){
__bf16 r = a[0]*a[1] - a[1]*a[2];
return r;
int main()
__bf16 ab[M];
for (long i = 0; i<M; i++){
ab[i] = (__bf16) (random()/((double) RAND_MAX));
__bf16 rb = bf16test(ab);
printf("Value bf16: %f\n", (double) rb);
return 0;

$icx -march=sapphirerapids -O3 t2.c -S

$vi t2.s

6 bf16test: #
7 .cfi_startproc
8 # %bb.0:
9 movzwl (%rdi), %eax
10 shll $16, %eax
11 vmovd %eax, %xmm0
12 movzwl 2(%rdi), %eax
13 shll $16, %eax
14 vmovd %eax, %xmm1
15 movzwl 4(%rdi), %eax
16 shll $16, %eax
17 vmovd %eax, %xmm2
18 vsubss %xmm2, %xmm0, %xmm0
19 vmulss %xmm1, %xmm0, %xmm0
20 vcvtneps2bf16 %xmm0, %xmm0
21 vmovw %xmm0, %eax
22 vmovw %eax, %xmm0
23 retq

79 callq __truncdfbf2@PLT
80 vmovw %xmm0, %eax
81 shll $16, %eax
82 vmovd %eax, %xmm0
83 shll $16, %ebx
84 vmovd %ebx, %xmm1
85 vsubss %xmm1, %xmm0, %xmm0
86 vmulss 12(%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
87 vcvtneps2bf16 %xmm0, %xmm0
88 vmovw %xmm0, %eax
89 shll $16, %eax
90 vmovd %eax, %xmm0
91 vcvtss2sd %xmm0, %xmm0, %xmm0
92 movl $.L.str, %edi
93 movb $1, %al


Unfortunately, oneAPI2024.1.0 still crashed on t2.c; hence, you can't see these code generated.

I used the internal one for the above code gen.

t2.c should be able to compile with the upcoming oneAPI2024.* release. I don't have an ETA to share but will let you know when it's available.



Thanks for looking into this!


You are right, I was not being precise enough. The vcvtneps2bf16 instructions are for conversion from bf16 to single. Dot products and matrix multiplications are then always done using AVX single precision instructions (e.g. vmulss) rather than using AMX tiles or AVX512-BF16 dot products. What I would like to see is the instructions  tdpbf16ps (matrix multiplication) or vdpbf16ps (dot product) used since these are what make using BF16 advantageous and efficient. Instead, everything the compiler does is simply casting everything to single precision.

Thank for clarification. Let me work with our code gen team and get back to you.

