<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dgemm-batch-slower-than-sequential-cblas-dgemm-loop-for/m-p/1727602#M37461</link>
    <description>&lt;P&gt;I have a workload that requires many small DGEMM operations, and I'm trying to determine if &lt;EM&gt;cblas_dgemm_batch&lt;/EM&gt; can improve performance over a simple loop of &lt;EM&gt;cblas_dgemm calls&lt;/EM&gt;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Workload:&lt;BR /&gt;&lt;/STRONG&gt;I process ~400,000 batches. Each batch contains multiple small DGEMM operations that I either execute sequentially (&lt;EM&gt;cblas_dgemm)&lt;/EM&gt; or via &lt;EM&gt;cblas_dgemm_batch&lt;/EM&gt;.&lt;/P&gt;&lt;P&gt;Batch sizes (operations per batch):&lt;BR /&gt;- Median: 13 ops&lt;BR /&gt;- Mean: 49 ops&lt;BR /&gt;- Range: 1 to 14,543 ops&lt;BR /&gt;- Most batches are small: 25% have ≤5 ops, 42% have ≤20 ops&lt;BR /&gt;&lt;BR /&gt;GEMM dimensions are small and vary within each batch:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;Min&lt;/TD&gt;&lt;TD width="20%"&gt;Max&lt;/TD&gt;&lt;TD width="20%"&gt;Median&lt;/TD&gt;&lt;TD width="20%"&gt;Mean&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;m&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;1419&lt;/TD&gt;&lt;TD width="20%"&gt;161&lt;/TD&gt;&lt;TD width="20%"&gt;268&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;n&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;80&lt;/TD&gt;&lt;TD width="20%"&gt;2&lt;/TD&gt;&lt;TD width="20%"&gt;7&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;k&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;80&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;3&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The comparison:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;// Option 1: Sequential loop
for (int i = 0; i &amp;lt; N; i++) {
cblas_dgemm(CblasColMajor, transa[i], transb[i], m[i], n[i], k[i], alpha[i], A[i], lda[i], B[i], ldb[i], beta[i], C[i], ldc[i]);
}&lt;/LI-CODE&gt;&lt;LI-CODE lang="cpp"&gt;  // Option 2: Batch call (N groups, 1 operation per group)
  std::vector&amp;lt;MKL_INT&amp;gt; group_size(N, 1);

  cblas_dgemm_batch(CblasColMajor,
                    transa.data(), transb.data(),
                    m.data(), n.data(), k.data(),
                    alpha.data(),
                    A_ptrs.data(), lda.data(),
                    B_ptrs.data(), ldb.data(),
                    beta.data(),
                    C_ptrs.data(), ldc.data(),
                    N, group_size.data());&lt;/LI-CODE&gt;&lt;P&gt;Since each GEMM has different dimensions, I use N groups with 1 operation each.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Environment:&lt;BR /&gt;&lt;/STRONG&gt;- CPU: Intel Xeon Platinum 8470&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;- MKL: 2023.2.0&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Results (total GEMM time across all ~400K batches):&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;Threads&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Sequential&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Note&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;1&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;72.5 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;93.6 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch is 29% slower&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;2&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;70.4 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;106.1 s&amp;nbsp;&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch is 51% slower&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;4&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;179 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;120 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Both much slower than 1 thread&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;8&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;266 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;195 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Both much slower than 1 thread&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Questions:&lt;BR /&gt;&lt;/STRONG&gt;1)&amp;nbsp;At 1-2 threads, why is cblas_dgemm_batch slower than a sequential loop?&lt;/P&gt;&lt;P&gt;2)&amp;nbsp;Why does performance degrade so much with more threads?&lt;BR /&gt;3)&amp;nbsp;For this type of workload (many small GEMMs with variable dimensions), is there a recommended approach?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 24 Nov 2025 15:36:20 GMT</pubDate>
    <dc:creator>miglia</dc:creator>
    <dc:date>2025-11-24T15:36:20Z</dc:date>
    <item>
      <title>cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dgemm-batch-slower-than-sequential-cblas-dgemm-loop-for/m-p/1727602#M37461</link>
      <description>&lt;P&gt;I have a workload that requires many small DGEMM operations, and I'm trying to determine if &lt;EM&gt;cblas_dgemm_batch&lt;/EM&gt; can improve performance over a simple loop of &lt;EM&gt;cblas_dgemm calls&lt;/EM&gt;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Workload:&lt;BR /&gt;&lt;/STRONG&gt;I process ~400,000 batches. Each batch contains multiple small DGEMM operations that I either execute sequentially (&lt;EM&gt;cblas_dgemm)&lt;/EM&gt; or via &lt;EM&gt;cblas_dgemm_batch&lt;/EM&gt;.&lt;/P&gt;&lt;P&gt;Batch sizes (operations per batch):&lt;BR /&gt;- Median: 13 ops&lt;BR /&gt;- Mean: 49 ops&lt;BR /&gt;- Range: 1 to 14,543 ops&lt;BR /&gt;- Most batches are small: 25% have ≤5 ops, 42% have ≤20 ops&lt;BR /&gt;&lt;BR /&gt;GEMM dimensions are small and vary within each batch:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;Min&lt;/TD&gt;&lt;TD width="20%"&gt;Max&lt;/TD&gt;&lt;TD width="20%"&gt;Median&lt;/TD&gt;&lt;TD width="20%"&gt;Mean&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;m&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;1419&lt;/TD&gt;&lt;TD width="20%"&gt;161&lt;/TD&gt;&lt;TD width="20%"&gt;268&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;n&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;80&lt;/TD&gt;&lt;TD width="20%"&gt;2&lt;/TD&gt;&lt;TD width="20%"&gt;7&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;k&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;80&lt;/TD&gt;&lt;TD width="20%"&gt;1&lt;/TD&gt;&lt;TD width="20%"&gt;3&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The comparison:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;// Option 1: Sequential loop
for (int i = 0; i &amp;lt; N; i++) {
cblas_dgemm(CblasColMajor, transa[i], transb[i], m[i], n[i], k[i], alpha[i], A[i], lda[i], B[i], ldb[i], beta[i], C[i], ldc[i]);
}&lt;/LI-CODE&gt;&lt;LI-CODE lang="cpp"&gt;  // Option 2: Batch call (N groups, 1 operation per group)
  std::vector&amp;lt;MKL_INT&amp;gt; group_size(N, 1);

  cblas_dgemm_batch(CblasColMajor,
                    transa.data(), transb.data(),
                    m.data(), n.data(), k.data(),
                    alpha.data(),
                    A_ptrs.data(), lda.data(),
                    B_ptrs.data(), ldb.data(),
                    beta.data(),
                    C_ptrs.data(), ldc.data(),
                    N, group_size.data());&lt;/LI-CODE&gt;&lt;P&gt;Since each GEMM has different dimensions, I use N groups with 1 operation each.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Environment:&lt;BR /&gt;&lt;/STRONG&gt;- CPU: Intel Xeon Platinum 8470&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;- MKL: 2023.2.0&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Results (total GEMM time across all ~400K batches):&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;Threads&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Sequential&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Note&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;1&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;72.5 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;93.6 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch is 29% slower&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;2&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;70.4 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;106.1 s&amp;nbsp;&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Batch is 51% slower&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;4&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;179 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;120 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Both much slower than 1 thread&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="25px"&gt;8&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;266 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;195 s&lt;/TD&gt;&lt;TD width="25%" height="25px"&gt;Both much slower than 1 thread&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Questions:&lt;BR /&gt;&lt;/STRONG&gt;1)&amp;nbsp;At 1-2 threads, why is cblas_dgemm_batch slower than a sequential loop?&lt;/P&gt;&lt;P&gt;2)&amp;nbsp;Why does performance degrade so much with more threads?&lt;BR /&gt;3)&amp;nbsp;For this type of workload (many small GEMMs with variable dimensions), is there a recommended approach?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Nov 2025 15:36:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dgemm-batch-slower-than-sequential-cblas-dgemm-loop-for/m-p/1727602#M37461</guid>
      <dc:creator>miglia</dc:creator>
      <dc:date>2025-11-24T15:36:20Z</dc:date>
    </item>
  </channel>
</rss>

