<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon) in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1688040#M37119</link>
    <description>&lt;P&gt;GEMMs shapes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;small-oc-big-ic&lt;UL&gt;&lt;LI&gt;mb1ic7168oc704&lt;/LI&gt;&lt;LI&gt;mb4ic7168oc704&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;small-ic-big-oc&lt;UL&gt;&lt;LI&gt;mb1ic352oc7168&lt;/LI&gt;&lt;LI&gt;mb4ic352oc7168&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Dtypes: INT8&lt;/P&gt;&lt;P&gt;Host:&lt;/P&gt;&lt;P&gt;GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and&amp;nbsp;~233&amp;nbsp;GB/s mem bandwidth&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Problem descriptions:&lt;/P&gt;&lt;P&gt;We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233&amp;nbsp;GB/s per subnuma) when running GEMMs with&amp;nbsp;small-ic-big-oc or small-oc-big-ic when mb is also small.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-oc-big-ic, we already parallel block_mb-block_ic-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-ic-big-oc,&amp;nbsp;we already parallel block_mb-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Weights is int8, and we are using int8 intrisics for computation.&lt;/P&gt;&lt;P&gt;Activation is bf16, and runtime quantized per row to int8.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With above optimizations we only get half&amp;nbsp;memory bandwidth ( half of&amp;nbsp;~233&amp;nbsp;GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full&amp;nbsp;bandwidth&amp;nbsp; ~233&amp;nbsp;GB/s .&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any more optimization to be considered? Thanks.&lt;/P&gt;</description>
    <pubDate>Wed, 07 May 2025 05:22:17 GMT</pubDate>
    <dc:creator>JN-G-0</dc:creator>
    <dc:date>2025-05-07T05:22:17Z</dc:date>
    <item>
      <title>How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1688040#M37119</link>
      <description>&lt;P&gt;GEMMs shapes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;small-oc-big-ic&lt;UL&gt;&lt;LI&gt;mb1ic7168oc704&lt;/LI&gt;&lt;LI&gt;mb4ic7168oc704&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;small-ic-big-oc&lt;UL&gt;&lt;LI&gt;mb1ic352oc7168&lt;/LI&gt;&lt;LI&gt;mb4ic352oc7168&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Dtypes: INT8&lt;/P&gt;&lt;P&gt;Host:&lt;/P&gt;&lt;P&gt;GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and&amp;nbsp;~233&amp;nbsp;GB/s mem bandwidth&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Problem descriptions:&lt;/P&gt;&lt;P&gt;We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233&amp;nbsp;GB/s per subnuma) when running GEMMs with&amp;nbsp;small-ic-big-oc or small-oc-big-ic when mb is also small.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-oc-big-ic, we already parallel block_mb-block_ic-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-ic-big-oc,&amp;nbsp;we already parallel block_mb-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Weights is int8, and we are using int8 intrisics for computation.&lt;/P&gt;&lt;P&gt;Activation is bf16, and runtime quantized per row to int8.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With above optimizations we only get half&amp;nbsp;memory bandwidth ( half of&amp;nbsp;~233&amp;nbsp;GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full&amp;nbsp;bandwidth&amp;nbsp; ~233&amp;nbsp;GB/s .&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any more optimization to be considered? Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 05:22:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1688040#M37119</guid>
      <dc:creator>JN-G-0</dc:creator>
      <dc:date>2025-05-07T05:22:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeo</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1690347#M37158</link>
      <description>&lt;P&gt;oneMKL doesn't support int8 on CPU for C interface yet. More oneMKL C interface developer reference here&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2025-1/overview.html" target="_blank"&gt;Developer Reference for Intel® oneAPI Math Kernel Library - C&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you talking oneMKL Sycl interface? And have you try Intel oneDNN lib for int8 in Gemm?&amp;nbsp;&lt;A href="https://github.com/uxlfoundation/oneDNN/blob/main/src/cpu/gemm/s8x8s32/ref_gemm_s8x8s32.cpp" target="_blank"&gt;oneDNN/src/cpu/gemm/s8x8s32/ref_gemm_s8x8s32.cpp at main · uxlfoundation/oneDNN&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 07:20:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1690347#M37158</guid>
      <dc:creator>Ruqiu_C_Intel</dc:creator>
      <dc:date>2025-05-16T07:20:42Z</dc:date>
    </item>
  </channel>
</rss>

