<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeo in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692080#M8546</link>
    <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/427762"&gt;@novahayes12&lt;/a&gt;&amp;nbsp;&amp;nbsp; Spoken like a true chatgpt answer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Doc (not an Intel employee or contractor)&lt;BR /&gt;[W10 is this generation's XP]&lt;/P&gt;</description>
    <pubDate>Fri, 23 May 2025 19:23:07 GMT</pubDate>
    <dc:creator>AlHill</dc:creator>
    <dc:date>2025-05-23T19:23:07Z</dc:date>
    <item>
      <title>How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon)</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1688019#M8540</link>
      <description>&lt;P&gt;GEMMs shapes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;small-oc-big-ic&lt;UL&gt;&lt;LI&gt;mb1ic7168oc704&lt;/LI&gt;&lt;LI&gt;mb4ic7168oc704&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;small-ic-big-oc&lt;UL&gt;&lt;LI&gt;mb1ic352oc7168&lt;/LI&gt;&lt;LI&gt;mb4ic352oc7168&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Dtypes: INT8&lt;/P&gt;&lt;P&gt;Host:&lt;/P&gt;&lt;P&gt;GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and&amp;nbsp;~233&amp;nbsp;GB/s mem bandwidth&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Problem descriptions:&lt;/P&gt;&lt;P&gt;We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233&amp;nbsp;GB/s per subnuma) when running GEMMs with&amp;nbsp;small-ic-big-oc or small-oc-big-ic when mb is also small.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-oc-big-ic, we already parallel block_mb-block_ic-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;For&amp;nbsp;small-ic-big-oc,&amp;nbsp;we already parallel block_mb-block_oc,&amp;nbsp; block size is all 32.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Weights is int8, and we are using int8 intrisics for computation.&lt;/P&gt;&lt;P&gt;Activation is bf16, and runtime quantized per row to int8.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With above optimizations we only get half&amp;nbsp;memory bandwidth ( half of&amp;nbsp;~233&amp;nbsp;GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full&amp;nbsp;bandwidth&amp;nbsp; ~233&amp;nbsp;GB/s .&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any more optimization to be considered? Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 03:08:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1688019#M8540</guid>
      <dc:creator>JN-G-0</dc:creator>
      <dc:date>2025-05-07T03:08:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeo</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692079#M8545</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Thanks for sharing the detailed info about your GEMMs optimization challenge on Xeon with INT8 data. Given that you’re getting only about half the expected memory bandwidth (~115 GB/s instead of ~233 GB/s), here are some thoughts:&lt;/P&gt;&lt;H3&gt;Key Optimization Areas to Consider:&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Memory Access and Bandwidth:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Ensure memory is allocated and accessed in a NUMA-aware fashion to avoid cross-node penalties.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Use aggressive prefetching to hide latency and improve cache utilization.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Consider reorganizing data layouts to improve spatial locality.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Blocking and Parallelization:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Revisit your blocking sizes—while 32 is a common choice, tuning it for your specific cache sizes and workload might yield gains.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Evaluate thread balancing especially since minibatch sizes are small (mb=1 or 4), which can impact parallel efficiency.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Compute Kernel Efficiency:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Make sure your INT8 intrinsics use AVX512 VNNI or equivalent instructions to maximize throughput.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Check if fused multiply-add (FMA) instructions are properly leveraged.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Quantization Overheads:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;If runtime per-row quantization to INT8 is costly or causing non-uniform memory access, consider pre-quantizing or coarser granularity quantization.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Profiling:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Tools like Intel VTune can reveal whether you are memory bound or compute bound, and show cache usage patterns.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 May 2025 19:13:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692079#M8545</guid>
      <dc:creator>novahayes12</dc:creator>
      <dc:date>2025-05-23T19:13:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeo</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692080#M8546</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/427762"&gt;@novahayes12&lt;/a&gt;&amp;nbsp;&amp;nbsp; Spoken like a true chatgpt answer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Doc (not an Intel employee or contractor)&lt;BR /&gt;[W10 is this generation's XP]&lt;/P&gt;</description>
      <pubDate>Fri, 23 May 2025 19:23:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692080#M8546</guid>
      <dc:creator>AlHill</dc:creator>
      <dc:date>2025-05-23T19:23:07Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeo</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692183#M8548</link>
      <description>&lt;P&gt;Thanks for the detailed info.&lt;/P&gt;&lt;P&gt;With small mb and ic/oc sizes, limited memory-level parallelism or underutilized vector units could be the bottleneck. Make sure memory is NUMA-local, thread pinning is correct, and AVX-512 or AMX (if supported) is fully used. Also consider fusing GEMMs or tuning block sizes for better cache and bandwidth usage.&lt;/P&gt;&lt;P&gt;Best regards,&lt;BR /&gt;Nova Hayes&lt;/P&gt;</description>
      <pubDate>Sat, 24 May 2025 19:00:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/How-to-optimize-GEMMs-with-small-ic-big-oc-or-small-oc-big-ic/m-p/1692183#M8548</guid>
      <dc:creator>novahayes12</dc:creator>
      <dc:date>2025-05-24T19:00:44Z</dc:date>
    </item>
  </channel>
</rss>

