Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
724 Discussions

AMX extensions use in bfloat16 matrix multiplication

crocix
Novice
2,634 Views

I have access to a sapphire rapids machine and I want to multiply two bfloat16 matrices A and B and compute C = A*B by exploiting AMX_BF16 extensons. I am happy with C being stored in single precision. What is the recommended way of doing this with current Intel software resources?

 

OPTION 1: Directly using intel intrinsics by following https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html#gs.67r3za . However, the author of this code example states that this should not be used as a basis for production code and that it was only made for demostration purposes.

 

OPTION 2: Intel MKL cblas_gemm_bf16bf16f32: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/cblas-gemm-bf16bf16f32.html

 

OPTION 3: Let the compiler handle it. I tried compiling a matrix-matrix multiplication example such as the following

 

 

#include <stdlib.h>
#include <stdio.h>

int main(){

   __bf16 A[1024];
   __bf16 B[1024];
   float C[256]={};

   for(int i=0; i<1024; i++){
       A[i] = 1.;
       B[i] = 2.;
   }

   for (int k=0; k < 64; k++){
       for (int i=0; i < 16; i++){
           for (int j=0; j < 16; j++){
               C[i*16 + j] += ((float) A[i*64 + k])*((float) B[j*64 + k]);
           }
       }
   }

   for (int i = 0; i < 16; i++){
     for (int j = 0; j < 16; j++){
         printf("%f ", (float) C[i*16 + j]);
     }
     printf("\n");
   }
   printf("\n");
}

 

 

with gcc -march=sapphirerapids -O3 -mamx-bf16 (icx compiler just crashes due to a bug). However, looking at the generated assembly code on godbolt.org it does not seem like AMX instructions are used at all.

0 Kudos
12 Replies
Alex_Y_Intel
Moderator
2,503 Views

Your issue has been escalated to our engineers and we'll work on it internally and reply you the update when we have a solution. 

0 Kudos
Viet_H_Intel
Moderator
2,498 Views

Hi, 

Did you encounter a compiler error or a runtime error?

The code compiled successfully with oneAPI2024.1.0

 

$ icx -march=sapphirerapids -O3 -mamx-bf16 t3.c -c -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

$

 

0 Kudos
crocix
Novice
2,482 Views

Hi There! I had used oneAPI2024.0.2, but the bug was fixed in oneAPI2024.1.0 and it now compiles fine which is good.

 

However, this is not my question: The latest icx does not use AMX-BF16 (nor AVX512-BF16) assembly instructions when compiling the above code. You can check with the -S compiler flag that the instructions tdpbf16ps or vdpbf16ps DO NOT come up indicating that neither AMX-BF16 nor AVX512-BF16 instructions are used by the compiler.

 

It might be me, but I am finding it impossible (with oneAPI2024.1.0 and the latest gcc and llvm compilers) to use these BF16 instruction sets (AMX-BF16 and AVX512-BF16) without directly calling Intel intrinsics. The problem with the intrinsics is that it is very hard for non-experts to achieve performance that is competitive with compiler-generated instructions. If there were an easier way it would be extremely interesting to know!

 

Thanks a lot!

0 Kudos
Viet_H_Intel
Moderator
2,465 Views

The compiler should be able to generate AMX-BF16 and AVX512-BF16 instructions with -march=sapphirerapids.

I used your test case posted on the other thread and see this vcvtneps2bf16

$ cat t2.c
#include <stdlib.h>
#include <stdio.h>
#define M 4
__bf16 bf16test(__bf16* a){
__bf16 r = a[0]*a[1] - a[1]*a[2];
return r;
}
int main()
{
srandom(12345678);
__bf16 ab[M];
for (long i = 0; i<M; i++){
ab[i] = (__bf16) (random()/((double) RAND_MAX));
}
__bf16 rb = bf16test(ab);
printf("Value bf16: %f\n", (double) rb);
return 0;
}

$icx -march=sapphirerapids -O3 t2.c -S

$vi t2.s

6 bf16test: #
7 .cfi_startproc
8 # %bb.0:
9 movzwl (%rdi), %eax
10 shll $16, %eax
11 vmovd %eax, %xmm0
12 movzwl 2(%rdi), %eax
13 shll $16, %eax
14 vmovd %eax, %xmm1
15 movzwl 4(%rdi), %eax
16 shll $16, %eax
17 vmovd %eax, %xmm2
18 vsubss %xmm2, %xmm0, %xmm0
19 vmulss %xmm1, %xmm0, %xmm0
20 vcvtneps2bf16 %xmm0, %xmm0
21 vmovw %xmm0, %eax
22 vmovw %eax, %xmm0
23 retq

79 callq __truncdfbf2@PLT
80 vmovw %xmm0, %eax
81 shll $16, %eax
82 vmovd %eax, %xmm0
83 shll $16, %ebx
84 vmovd %ebx, %xmm1
85 vsubss %xmm1, %xmm0, %xmm0
86 vmulss 12(%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
87 vcvtneps2bf16 %xmm0, %xmm0
88 vmovw %xmm0, %eax
89 shll $16, %eax
90 vmovd %eax, %xmm0
91 vcvtss2sd %xmm0, %xmm0, %xmm0
92 movl $.L.str, %edi
93 movb $1, %al

 

Unfortunately, oneAPI2024.1.0 still crashed on t2.c; hence, you can't see these code generated.

I used the internal one for the above code gen.

t2.c should be able to compile with the upcoming oneAPI2024.* release. I don't have an ETA to share but will let you know when it's available.

 

 

0 Kudos
crocix
Novice
2,459 Views

Thanks for looking into this!

 

You are right, I was not being precise enough. The vcvtneps2bf16 instructions are for conversion from bf16 to single. Dot products and matrix multiplications are then always done using AVX single precision instructions (e.g. vmulss) rather than using AMX tiles or AVX512-BF16 dot products. What I would like to see is the instructions  tdpbf16ps (matrix multiplication) or vdpbf16ps (dot product) used since these are what make using BF16 advantageous and efficient. Instead, everything the compiler does is simply casting everything to single precision.

0 Kudos
Viet_H_Intel
Moderator
2,452 Views

Thank for clarification. Let me work with our code gen team and get back to you.

0 Kudos
cristobal1
Beginner
1,371 Views

Dear all,

Is there any news on this issue? I think I will be going through the same story and wanted to anticipate by asking if anything has changed since the last post. Would be great to know that the OPTION 2 (Intel MKL CBLAS example) and OPTION 3 (Compiler) make use of AMX instructions.

 

0 Kudos
Viet_H_Intel
Moderator
1,338 Views

Hi, 

 

I have some inputs from our Developer: 

  1. The code has a problem, a tile register contains a maximum size of 16 rows x 64 bytes = 1024 bytes, while __bf16 is 2 bytes long. So A/B[1024] cannot be filled into one tile register;
  2. Compiler doesn't support auto generation of AMX instructions. We provide 3 methods for easy use:

Thanks,

0 Kudos
crocix
Novice
1,331 Views

@Viet_H_Intel 

 

Thanks a lot for the reply and please thank the developer as well!

 

I picked the matrix size in the code on purpose: while it is true that it does not fit the tile register, that matrix multiplication can still be implemented with AMX by splitting the matrices in two blocks and reshaping them accordingly.

 

It would be nice if compilers sorted this out for us in the future and/or if such larger matrix multiplications were sorted out for us in the MKL library: All three options do not go beyond the maximum tile size which means this splitting/blocking of larger matrices must be done by hand. Furthermore, the memory layout needed so that AMX can be used for these blocked operations is highly non-trivial and not too-well documented.

 

Intel seems to be pushing for these new accelerators, but until user support improves it will be harder for it to be taken up. People shouldn't need to learn about Intel intrinsics to be able to use AMX. NVIDIA CuBLAS uses tensor cores automatically for instance.

 

Is there any additional documentation for the options in 2? I found the Intel development guide rather obscure in this regard.

 

Thanks a lot!

0 Kudos
Viet_H_Intel
Moderator
1,329 Views

For option 2: can you please post a new topic at

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library

so that MKL questions can be addressed by MKL team?

 

Thanks,

0 Kudos
crocix
Novice
1,318 Views

Question posted, thanks:

 

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245

 

Again, it would be nice to see compiler and/or higher-level support for this and some more documentation with multiple examples.

 

Thanks!

0 Kudos
crocix
Novice
963 Views

It seems like the latest oneMKL does support AMX-accelerated gemms:

 

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245

 

In my experience, this is slightly slower than using AMX intrinsics, but it is so much more convenient.

0 Kudos
Reply