- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do the MKL CBLAS functions cblas_gemm_bf16bf16f32 or cblas_gemm_bf16bf16f32_compute use AMX-BF16 instructions on supported hardware (e.g., Sapphire Rapids)?
If yes: AMX tiles require a specific memory layout and have a maximum size so the multiplication of large matrices must be blocked if AMX needs to be used. If the above CBLAS functions do use AMX do they also sort out the memory layout for the user automatically?
If no: Are AVX512-BF16 _mm?_dpbf16_ps instructions used instead?
Thanks a lot!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
cblas_gemm_bf16bf16f32_compute uses AMX-BF16 instructions on supported hardware. On architectures without native bfloat16 hardware instructions, matrix A and B are upconverted to single precision and SGEMM is called to compute matrix multiplication operation.(https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gemm-bf16bf16f32-compute.html)
oneMKL can do the memory layout internally for the functions supported AMX.
- Ruqiu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
cblas_gemm_bf16bf16f32_compute uses AMX-BF16 instructions on supported hardware. On architectures without native bfloat16 hardware instructions, matrix A and B are upconverted to single precision and SGEMM is called to compute matrix multiplication operation.(https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gemm-bf16bf16f32-compute.html)
oneMKL can do the memory layout internally for the functions supported AMX.
- Ruqiu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you point me to a working example of using this method. The documentation above points to:
cblas_gemm_bf16bf16f32_compute: examples\cblas\source\cblas_gemm_bf16bf16f32_computex.c
But my Linux /opt/intel/oneapi installation does not have this file or the examples directory. oneapi-cli also didnt have this example, or could not easily find it.
I would love to start with a working example. I also want to go to intrinsics next, but want to start with this. I am willing to store the source matrices in bf16, but if there is a way to convert a float to bf16 on the fly efficiently, would love to know that too.
Appreciate any help I can get on this. Thanks in advance!
-RB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The examples are compressed in a .tgz file in
oneapi-2024.2.1/intel-oneapi-mkl-2024.2.1/mkl/latest/share/doc/mkl/examples
I think what you are looking for is in the file called examples_core_c.tgz (you can maybe simply use locate to find it). The mkl examples defines the bf16 type through unions. You will have to look through the headers of the example file to find the type definition and casting routines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you!
One more follow up question in case you know, what is the diffrence between cblas_gemm_f16f16f32x.c and cblas_gemm_f16f16f32computex.c
The difference appears to be that the array "A" is packed first in one case and not in the other case.
Why do they have two functions, what is the best practice recommendation, and is one faster than the other?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The packed version (computex) uses AMX instructions (up to 64x speedup compared to AVX512 instructions in double precision). The other one I do not know, but it may be that it uses AVX512-BF16 instructions which are slower (up to 4x speedup compared to AVX512 instructions in double precision). You'll have to implement it and check timings (timings also depend on matrix sizes). If you search for Intel AMX and AVX512 on Google you'll find more information.
If you want an example on how to use AMX and AVX intrinsics directly, I have some code here:
https://github.com/croci/mpfem-paper-experiments-2024/tree/main
Look into src/local_kernel/ . However, it is code for my research and it is not commented. I doubt you'll find it useful, posting it just in case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the pointers and suggestions.
Ironically, using cblas_gemm_bf16bf16f32_compute() is slower than cblas_gemm_bf16bf16f32().
I could use help in seeing if my intuition is right. I am using these routines to multiply a 1x1024 array by 1024x32768 array using these routines. So its not a traditional matrix multiply. So wondering if that is causing the slowdown with AMX, since it may be better suited for 2D arrays on both sides.
I found it odd that in the example code: the routine only packs the A array but not the B array. It does produce the right result.
In the documentation, it says if A is packed, then B must be packed. (see the description of the B input).
I will try packing B, but I am afraid that will be expensive in itself and take away from the gains, since I have to pack the large 2D array (1024x32768) and then pass to the compute function.
Any advise on how best to optimize a 1D x 2D multiplication?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AMX instructions require a very specific memory layout (you can read about it in the Intel manuals), which in the MKL documentation they call "packed". You need to pack both A and B, else you will be computing the wrong thing and/or slow down the routine. The packing is essentially a copy operation with some specific memory movement. I am sure the MKL packing functions are optimized. In my experience packing is never the bottleneck.
In terms of making things faster, AMX are designed for fused matrix multiply add operations of specific sizes. The optimal size for bf16 would be, if I am not mistaken, for the input matrices to be of size 16x32 (again, check the documentation or Section 2.3 in this article: https://arxiv.org/pdf/2410.12614) so it might be faster if your matrices can be cut up into submatrices of these sizes. B is fine, but A only has one row which is not divisible by 16. Perhaps copying A 16 times and reshaping B?
If you do not write things so that your operation can be decomposed into small matrix-matrix multiply then you shouldn't be using AMX (and perhaps not even a BLAS 3 routine).
I am sorry, but I will stop replying from now on. I do not work for Intel eheh. You can always ask the Intel staff in the forum. Best of luck!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kindly note that we are closing this thread. If you would like to continue to receive support for this issue, please create a new thread referencing this topic.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page