AMX extensions use in bfloat16 matrix multiplication

crocix · ‎03-25-2024

I have access to a sapphire rapids machine and I want to multiply two bfloat16 matrices A and B and compute C = A*B by exploiting AMX_BF16 extensons. I am happy with C being stored in single precision. What is the recommended way of doing this with current Intel software resources?

OPTION 1: Directly using intel intrinsics by following https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html#gs.67r3za . However, the author of this code example states that this should not be used as a basis for production code and that it was only made for demostration purposes.

OPTION 2: Intel MKL cblas_gemm_bf16bf16f32: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/cblas-gemm-bf16bf16f32.html

OPTION 3: Let the compiler handle it. I tried compiling a matrix-matrix multiplication example such as the following

#include <stdlib.h>
#include <stdio.h>

int main(){

   __bf16 A[1024];
   __bf16 B[1024];
   float C[256]={};

   for(int i=0; i<1024; i++){
       A[i] = 1.;
       B[i] = 2.;
   }

   for (int k=0; k < 64; k++){
       for (int i=0; i < 16; i++){
           for (int j=0; j < 16; j++){
               C[i*16 + j] += ((float) A[i*64 + k])*((float) B[j*64 + k]);
           }
       }
   }

   for (int i = 0; i < 16; i++){
     for (int j = 0; j < 16; j++){
         printf("%f ", (float) C[i*16 + j]);
     }
     printf("\n");
   }
   printf("\n");
}

with gcc -march=sapphirerapids -O3 -mamx-bf16 (icx compiler just crashes due to a bug). However, looking at the generated assembly code on godbolt.org it does not seem like AMX instructions are used at all.

Alex_Y_Intel · ‎04-03-2024

Your issue has been escalated to our engineers and we'll work on it internally and reply you the update when we have a solution.

Viet_H_Intel · ‎04-03-2024

Hi,

Did you encounter a compiler error or a runtime error?

The code compiled successfully with oneAPI2024.1.0

$ icx -march=sapphirerapids -O3 -mamx-bf16 t3.c -c -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

$

crocix · ‎04-04-2024

Hi There! I had used oneAPI2024.0.2, but the bug was fixed in oneAPI2024.1.0 and it now compiles fine which is good.

However, this is not my question: The latest icx does not use AMX-BF16 (nor AVX512-BF16) assembly instructions when compiling the above code. You can check with the -S compiler flag that the instructions tdpbf16ps or vdpbf16ps DO NOT come up indicating that neither AMX-BF16 nor AVX512-BF16 instructions are used by the compiler.

It might be me, but I am finding it impossible (with oneAPI2024.1.0 and the latest gcc and llvm compilers) to use these BF16 instruction sets (AMX-BF16 and AVX512-BF16) without directly calling Intel intrinsics. The problem with the intrinsics is that it is very hard for non-experts to achieve performance that is competitive with compiler-generated instructions. If there were an easier way it would be extremely interesting to know!

Thanks a lot!

Viet_H_Intel · ‎04-04-2024

The compiler should be able to generate AMX-BF16 and AVX512-BF16 instructions with -march=sapphirerapids.

I used your test case posted on the other thread and see this vcvtneps2bf16

$ cat t2.c
#include <stdlib.h>
#include <stdio.h>
#define M 4
__bf16 bf16test(__bf16* a){
__bf16 r = a[0]*a[1] - a[1]*a[2];
return r;
}
int main()
{
srandom(12345678);
__bf16 ab[M];
for (long i = 0; i<M; i++){
ab[i] = (__bf16) (random()/((double) RAND_MAX));
}
__bf16 rb = bf16test(ab);
printf("Value bf16: %f\n", (double) rb);
return 0;
}

$icx -march=sapphirerapids -O3 t2.c -S

$vi t2.s

6 bf16test: #
7 .cfi_startproc
8 # %bb.0:
9 movzwl (%rdi), %eax
10 shll $16, %eax
11 vmovd %eax, %xmm0
12 movzwl 2(%rdi), %eax
13 shll $16, %eax
14 vmovd %eax, %xmm1
15 movzwl 4(%rdi), %eax
16 shll $16, %eax
17 vmovd %eax, %xmm2
18 vsubss %xmm2, %xmm0, %xmm0
19 vmulss %xmm1, %xmm0, %xmm0
20 vcvtneps2bf16 %xmm0, %xmm0
21 vmovw %xmm0, %eax
22 vmovw %eax, %xmm0
23 retq

79 callq __truncdfbf2@PLT
80 vmovw %xmm0, %eax
81 shll $16, %eax
82 vmovd %eax, %xmm0
83 shll $16, %ebx
84 vmovd %ebx, %xmm1
85 vsubss %xmm1, %xmm0, %xmm0
86 vmulss 12(%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
87 vcvtneps2bf16 %xmm0, %xmm0
88 vmovw %xmm0, %eax
89 shll $16, %eax
90 vmovd %eax, %xmm0
91 vcvtss2sd %xmm0, %xmm0, %xmm0
92 movl $.L.str, %edi
93 movb $1, %al

Unfortunately, oneAPI2024.1.0 still crashed on t2.c; hence, you can't see these code generated.

I used the internal one for the above code gen.

t2.c should be able to compile with the upcoming oneAPI2024.* release. I don't have an ETA to share but will let you know when it's available.

crocix · ‎04-04-2024

Thanks for looking into this!

You are right, I was not being precise enough. The vcvtneps2bf16 instructions are for conversion from bf16 to single. Dot products and matrix multiplications are then always done using AVX single precision instructions (e.g. vmulss) rather than using AMX tiles or AVX512-BF16 dot products. What I would like to see is the instructions tdpbf16ps (matrix multiplication) or vdpbf16ps (dot product) used since these are what make using BF16 advantageous and efficient. Instead, everything the compiler does is simply casting everything to single precision.

Viet_H_Intel · ‎04-04-2024

Thank for clarification. Let me work with our code gen team and get back to you.

cristobal1 · ‎09-02-2024

Dear all,

Is there any news on this issue? I think I will be going through the same story and wanted to anticipate by asking if anything has changed since the last post. Would be great to know that the OPTION 2 (Intel MKL CBLAS example) and OPTION 3 (Compiler) make use of AMX instructions.

Viet_H_Intel · ‎09-03-2024

Hi,

I have some inputs from our Developer:

The code has a problem, a tile register contains a maximum size of 16 rows x 64 bytes = 1024 bytes, while __bf16 is 2 bytes long. So A/B[1024] cannot be filled into one tile register;
Compiler doesn't support auto generation of AMX instructions. We provide 3 methods for easy use:
- Assemble-like model, a directly mapping to AMX instructions. User needs to configure tile register shapes, specify tile register number etc. each time;
- C-like model. Compiler manages tile register shapes and tile register allocation, but user needs to handle matrix layout by themselves;
- oneapi_matrix. A high level wrapper, matrix layout is transparent to user. For more information, see https://github.com/intel/llvm/blob/1bd076b14ad3858b815207bed9c731f13ce75038/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc

Thanks,

crocix · ‎09-03-2024

@Viet_H_Intel

Thanks a lot for the reply and please thank the developer as well!

I picked the matrix size in the code on purpose: while it is true that it does not fit the tile register, that matrix multiplication can still be implemented with AMX by splitting the matrices in two blocks and reshaping them accordingly.

It would be nice if compilers sorted this out for us in the future and/or if such larger matrix multiplications were sorted out for us in the MKL library: All three options do not go beyond the maximum tile size which means this splitting/blocking of larger matrices must be done by hand. Furthermore, the memory layout needed so that AMX can be used for these blocked operations is highly non-trivial and not too-well documented.

Intel seems to be pushing for these new accelerators, but until user support improves it will be harder for it to be taken up. People shouldn't need to learn about Intel intrinsics to be able to use AMX. NVIDIA CuBLAS uses tensor cores automatically for instance.

Is there any additional documentation for the options in 2? I found the Intel development guide rather obscure in this regard.

Thanks a lot!

Viet_H_Intel · ‎09-03-2024

For option 2: can you please post a new topic at

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library

so that MKL questions can be addressed by MKL team?

Thanks,

crocix · ‎09-03-2024

Question posted, thanks:

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245

Again, it would be nice to see compiler and/or higher-level support for this and some more documentation with multiple examples.

Thanks!

crocix · ‎09-24-2024

It seems like the latest oneMKL does support AMX-accelerated gemms:

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245

In my experience, this is slightly slower than using AMX intrinsics, but it is so much more convenient.