<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: AMX extensions use in bfloat16 matrix multiplication in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628228#M4079</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have some inputs from our Developer:&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;The code has a problem, a tile register contains a maximum size of 16 rows x 64 bytes = 1024 bytes, while __bf16 is 2 bytes long. So A/B&lt;SPAN class="error"&gt;[1024]&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;cannot be filled into one tile register;&lt;/LI&gt;
&lt;LI&gt;Compiler doesn't support auto generation of AMX instructions. We provide 3 methods for easy use:
&lt;UL&gt;
&lt;LI&gt;Assemble-like model, a directly mapping to AMX instructions. User needs to configure tile register shapes, specify tile register number etc. each time;&lt;/LI&gt;
&lt;LI&gt;C-like model. Compiler manages tile register shapes and tile register allocation, but user needs to handle matrix layout by themselves;&lt;/LI&gt;
&lt;LI&gt;oneapi_matrix. A high level wrapper, matrix layout is transparent to user. For more information, see&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="external-link" href="https://github.com/intel/llvm/blob/1bd076b14ad3858b815207bed9c731f13ce75038/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc" target="_blank" rel="nofollow noopener"&gt;https://github.com/intel/llvm/blob/1bd076b14ad3858b815207bed9c731f13ce75038/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Thanks,&lt;/P&gt;</description>
    <pubDate>Tue, 03 Sep 2024 13:01:54 GMT</pubDate>
    <dc:creator>Viet_H_Intel</dc:creator>
    <dc:date>2024-09-03T13:01:54Z</dc:date>
    <item>
      <title>AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1583186#M3559</link>
      <description>&lt;P&gt;I have access to a sapphire rapids machine and I want to multiply two bfloat16 matrices A and B and compute C = A*B by exploiting AMX_BF16 extensons. I am happy with C being stored in single precision. What is the recommended way of doing this with current Intel software resources?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;OPTION 1: Directly using intel intrinsics by following&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html#gs.67r3za" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html#gs.67r3za&amp;nbsp;&lt;/A&gt;. However, the author of this code example states that this should not be used as a basis for production code and that it was only made for demostration purposes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;OPTION 2: Intel MKL cblas_gemm_bf16bf16f32:&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/cblas-gemm-bf16bf16f32.html" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/cblas-gemm-bf16bf16f32.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;OPTION 3: Let the compiler handle it. I tried compiling a matrix-matrix multiplication example such as the following&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;

int main(){

   __bf16 A[1024];
   __bf16 B[1024];
   float C[256]={};

   for(int i=0; i&amp;lt;1024; i++){
       A[i] = 1.;
       B[i] = 2.;
   }

   for (int k=0; k &amp;lt; 64; k++){
       for (int i=0; i &amp;lt; 16; i++){
           for (int j=0; j &amp;lt; 16; j++){
               C[i*16 + j] += ((float) A[i*64 + k])*((float) B[j*64 + k]);
           }
       }
   }

   for (int i = 0; i &amp;lt; 16; i++){
     for (int j = 0; j &amp;lt; 16; j++){
         printf("%f ", (float) C[i*16 + j]);
     }
     printf("\n");
   }
   printf("\n");
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;with gcc -march=sapphirerapids -O3 -mamx-bf16 (icx compiler just crashes due to a bug). However, looking at the generated assembly code on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://godbolt.org/" target="_blank" rel="noopener"&gt;godbolt.org&lt;/A&gt;&amp;nbsp;it does not seem like AMX instructions are used at all.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2024 16:34:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1583186#M3559</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-03-25T16:34:30Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586071#M3582</link>
      <description>&lt;P&gt;Your issue has been escalated to our engineers and we'll work on it internally and reply you the update when we have a solution.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 00:24:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586071#M3582</guid>
      <dc:creator>Alex_Y_Intel</dc:creator>
      <dc:date>2024-04-04T00:24:55Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586102#M3588</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Did you encounter a compiler error or a runtime error?&lt;/P&gt;
&lt;P&gt;The code compiled successfully with oneAPI2024.1.0&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;$ icx -march=sapphirerapids -O3 -mamx-bf16 t3.c -c -V&lt;BR /&gt;Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308&lt;BR /&gt;Copyright (C) 1985-2024 Intel Corporation. All rights reserved.&lt;/P&gt;
&lt;P&gt;$&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 01:49:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586102#M3588</guid>
      <dc:creator>Viet_H_Intel</dc:creator>
      <dc:date>2024-04-04T01:49:32Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586258#M3592</link>
      <description>&lt;P&gt;Hi There! I had used oneAPI2024.0.2, but the bug was fixed in oneAPI2024.1.0 and it now compiles fine which is good.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, this is not my question: The latest icx does not use AMX-BF16 (nor AVX512-BF16) assembly instructions&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;when compiling the above code. You can check with the -S compiler flag that the instructions&amp;nbsp;&lt;SPAN&gt;tdpbf16ps or vdpbf16ps DO NOT come up indicating that neither AMX-BF16 nor AVX512-BF16 instructions are used by the compiler.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It might be me, but I am finding it impossible (with oneAPI2024.1.0 and the latest gcc and llvm compilers) to use these BF16 instruction sets (AMX-BF16 and AVX512-BF16) without directly calling Intel intrinsics. The problem with the intrinsics is that it is very hard for non-experts to achieve performance that is competitive with compiler-generated instructions. If there were an easier way it would be extremely interesting to know!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 12:01:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586258#M3592</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-04-04T12:01:08Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586287#M3593</link>
      <description>&lt;P&gt;The compiler should be able to generate&amp;nbsp;&lt;SPAN&gt;AMX-BF16 and AVX512-BF16 instructions with&amp;nbsp;-march=sapphirerapids.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I used your test case posted on the other thread and see this&amp;nbsp;vcvtneps2bf16&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;$ cat t2.c&lt;BR /&gt;#include &amp;lt;stdlib.h&amp;gt;&lt;BR /&gt;#include &amp;lt;stdio.h&amp;gt;&lt;BR /&gt;#define M 4&lt;BR /&gt;__bf16 bf16test(__bf16* a){&lt;BR /&gt;__bf16 r = a[0]*a[1] - a[1]*a[2];&lt;BR /&gt;return r;&lt;BR /&gt;}&lt;BR /&gt;int main()&lt;BR /&gt;{&lt;BR /&gt;srandom(12345678);&lt;BR /&gt;__bf16 ab[M];&lt;BR /&gt;for (long i = 0; i&amp;lt;M; i++){&lt;BR /&gt;ab[i] = (__bf16) (random()/((double) RAND_MAX));&lt;BR /&gt;}&lt;BR /&gt;__bf16 rb = bf16test(ab);&lt;BR /&gt;printf("Value bf16: %f\n", (double) rb);&lt;BR /&gt;return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;$icx -march=sapphirerapids -O3 t2.c -S&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;$vi t2.s&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt; 6 bf16test: #&lt;BR /&gt;7 .cfi_startproc&lt;BR /&gt;8 # %bb.0:&lt;BR /&gt;9 movzwl (%rdi), %eax&lt;BR /&gt;10 shll $16, %eax&lt;BR /&gt;11 vmovd %eax, %xmm0&lt;BR /&gt;12 movzwl 2(%rdi), %eax&lt;BR /&gt;13 shll $16, %eax&lt;BR /&gt;14 vmovd %eax, %xmm1&lt;BR /&gt;15 movzwl 4(%rdi), %eax&lt;BR /&gt;16 shll $16, %eax&lt;BR /&gt;17 vmovd %eax, %xmm2&lt;BR /&gt;18 vsubss %xmm2, %xmm0, %xmm0&lt;BR /&gt;19 vmulss %xmm1, %xmm0, %xmm0&lt;BR /&gt;20 &lt;STRONG&gt;vcvtneps2bf16&lt;/STRONG&gt; %xmm0, %xmm0&lt;BR /&gt;21 vmovw %xmm0, %eax&lt;BR /&gt;22 vmovw %eax, %xmm0&lt;BR /&gt;23 retq&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt; 79 callq __truncdfbf2@PLT&lt;BR /&gt;80 vmovw %xmm0, %eax&lt;BR /&gt;81 shll $16, %eax&lt;BR /&gt;82 vmovd %eax, %xmm0&lt;BR /&gt;83 shll $16, %ebx&lt;BR /&gt;84 vmovd %ebx, %xmm1&lt;BR /&gt;85 vsubss %xmm1, %xmm0, %xmm0&lt;BR /&gt;86 vmulss 12(%rsp), %xmm0, %xmm0 # 4-byte Folded Reload&lt;BR /&gt;87 &lt;STRONG&gt;vcvtneps2bf16&lt;/STRONG&gt; %xmm0, %xmm0&lt;BR /&gt;88 vmovw %xmm0, %eax&lt;BR /&gt;89 shll $16, %eax&lt;BR /&gt;90 vmovd %eax, %xmm0&lt;BR /&gt;91 vcvtss2sd %xmm0, %xmm0, %xmm0&lt;BR /&gt;92 movl $.L.str, %edi&lt;BR /&gt;93 movb $1, %al&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Unfortunately, oneAPI2024.1.0 still crashed on t2.c; hence, you can't see these code generated.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I used the internal one for the above code gen. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;t2.c should be able to compile with the upcoming oneAPI2024.* release. I don't have an ETA to share but will let you know when it's available.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 12:30:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586287#M3593</guid>
      <dc:creator>Viet_H_Intel</dc:creator>
      <dc:date>2024-04-04T12:30:54Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586290#M3595</link>
      <description>&lt;P&gt;Thanks for looking into this!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You are right, I was not being precise enough. The&amp;nbsp;&lt;STRONG&gt;vcvtneps2bf16&lt;/STRONG&gt; instructions are for conversion from bf16 to single. Dot products and matrix multiplications are then always done using AVX single precision instructions (e.g. &lt;SPAN&gt;&lt;STRONG&gt;vmulss&lt;/STRONG&gt;) rather than using AMX tiles or AVX512-BF16 dot products&lt;/SPAN&gt;. What I would like to see is the instructions&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;tdpbf16ps&lt;/STRONG&gt; (matrix multiplication) or &lt;STRONG&gt;vdpbf16ps&lt;/STRONG&gt; (dot product) used since these are what make using BF16 advantageous and efficient. Instead, everything the compiler does is simply casting everything to single precision.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 13:00:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586290#M3595</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-04-04T13:00:39Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586296#M3597</link>
      <description>&lt;P&gt;Thank for clarification. Let me work with our code gen team and get back to you.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 13:53:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1586296#M3597</guid>
      <dc:creator>Viet_H_Intel</dc:creator>
      <dc:date>2024-04-04T13:53:48Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628140#M4077</link>
      <description>&lt;P&gt;Dear all,&lt;/P&gt;&lt;P&gt;Is there any news on this issue? I think I will be going through the same story and wanted to anticipate by asking if anything has changed since the last post. Would be great to know that the OPTION 2 (Intel MKL CBLAS example) and OPTION 3 (Compiler) make use of AMX instructions.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2024 04:01:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628140#M4077</guid>
      <dc:creator>cristobal1</dc:creator>
      <dc:date>2024-09-03T04:01:13Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628228#M4079</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have some inputs from our Developer:&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;The code has a problem, a tile register contains a maximum size of 16 rows x 64 bytes = 1024 bytes, while __bf16 is 2 bytes long. So A/B&lt;SPAN class="error"&gt;[1024]&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;cannot be filled into one tile register;&lt;/LI&gt;
&lt;LI&gt;Compiler doesn't support auto generation of AMX instructions. We provide 3 methods for easy use:
&lt;UL&gt;
&lt;LI&gt;Assemble-like model, a directly mapping to AMX instructions. User needs to configure tile register shapes, specify tile register number etc. each time;&lt;/LI&gt;
&lt;LI&gt;C-like model. Compiler manages tile register shapes and tile register allocation, but user needs to handle matrix layout by themselves;&lt;/LI&gt;
&lt;LI&gt;oneapi_matrix. A high level wrapper, matrix layout is transparent to user. For more information, see&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="external-link" href="https://github.com/intel/llvm/blob/1bd076b14ad3858b815207bed9c731f13ce75038/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc" target="_blank" rel="nofollow noopener"&gt;https://github.com/intel/llvm/blob/1bd076b14ad3858b815207bed9c731f13ce75038/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Thanks,&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2024 13:01:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628228#M4079</guid>
      <dc:creator>Viet_H_Intel</dc:creator>
      <dc:date>2024-09-03T13:01:54Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628232#M4080</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/41883"&gt;@Viet_H_Intel&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot for the reply and please thank the developer as well!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I picked the matrix size in the code on purpose: while it is true that it does not fit the tile register, that matrix multiplication can still be implemented with AMX by splitting the matrices in two blocks and reshaping them accordingly.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It would be nice if compilers sorted this out for us in the future and/or if such larger matrix multiplications were sorted out for us in the MKL library: All three options do not go beyond the maximum tile size which means this splitting/blocking of larger matrices must be done by hand. Furthermore, the memory layout needed so that AMX can be used for these blocked operations is highly non-trivial and not too-well documented.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Intel seems to be pushing for these new accelerators, but until user support improves it will be harder for it to be taken up. People shouldn't need to learn about Intel intrinsics to be able to use AMX. NVIDIA CuBLAS uses tensor cores automatically for instance.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any additional documentation for the options in 2? I found the Intel development guide rather obscure in this regard.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2024 13:25:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628232#M4080</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-09-03T13:25:06Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628236#M4081</link>
      <description>&lt;P&gt;For option 2: can you please post a new topic at&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library" target="_blank"&gt;https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;so that MKL questions can be addressed by MKL team?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2024 13:31:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628236#M4081</guid>
      <dc:creator>Viet_H_Intel</dc:creator>
      <dc:date>2024-09-03T13:31:37Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628248#M4082</link>
      <description>&lt;P&gt;Question posted, thanks:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245" target="_blank"&gt;https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Again, it would be nice to see compiler and/or higher-level support for this and some more documentation with multiple examples.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2024 14:00:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1628248#M4082</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-09-03T14:00:00Z</dc:date>
    </item>
    <item>
      <title>Re: AMX extensions use in bfloat16 matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1633066#M4102</link>
      <description>&lt;P&gt;It seems like the latest oneMKL does support AMX-accelerated gemms:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245" target="_blank"&gt;https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/AMX-bf16-use-in-MKL-CBLAS/m-p/1628245&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In my experience, this is slightly slower than using AMX intrinsics, but it is so much more convenient.&lt;/P&gt;</description>
      <pubDate>Tue, 24 Sep 2024 08:40:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/AMX-extensions-use-in-bfloat16-matrix-multiplication/m-p/1633066#M4102</guid>
      <dc:creator>crocix</dc:creator>
      <dc:date>2024-09-24T08:40:21Z</dc:date>
    </item>
  </channel>
</rss>

