<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: sparse::optimize_trsm makes the trsm function slower on GPU in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1680531#M37039</link>
    <description>&lt;P&gt;Hi Jakub,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The oneMKL 2025.1 release is now available. Did you get a chance to verify the improvement?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Fengrui&lt;/P&gt;</description>
    <pubDate>Fri, 04 Apr 2025 23:18:02 GMT</pubDate>
    <dc:creator>Fengrui</dc:creator>
    <dc:date>2025-04-04T23:18:02Z</dc:date>
    <item>
      <title>sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643719#M36650</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am using the sparse::trsm function and I discovered that sparse::optimize_trsm horribly slows down the sparse::trsm function, instead of speeding it up as the name suggests. The sparse::trsm is about 40x slower if I use&amp;nbsp;sparse::optimize_trsm versus if I don't.&lt;/P&gt;&lt;P&gt;I created an example code and matrices where I demonstrate it, see the attachment. There are restrictions to file uploads, so I have to use a onedrive link:&amp;nbsp;&lt;A href="https://vsb-my.sharepoint.com/:u:/g/personal/hom0056_vsb_cz/ER6pS0eGWmBMphw5D1XK7cYBnIYr37QX7Nvz8VLwB5IReQ?e=OeO8Wq" target="_blank" rel="noopener"&gt;onemkl_sparse_trs_optimize.zip&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Compile with `make` and run with `make run`.&lt;/P&gt;&lt;P&gt;I think the code does not need much explanation - it just loads the triangular matrices from files, and performs the trsv and trsm function with an optional optimize. I test both lower and upper triangular matrices. The matrices are actual matrices I exported from the app where I use them.&lt;/P&gt;&lt;P&gt;I measure the time as an average of 3 runs, there is one extra warmup run. I use the 2025.0.0 version of the Intel toolkit, and a Datacenter GPU Max 1550 GPU on a Tiber devcloud instance.&lt;/P&gt;&lt;P&gt;This is the output of the example code that I observe (in each row of the output - L is lower triangular, U is upper triangular; trsV and trsM kernels; 0 for don't optimize, 1 means I use the optimize function; the first time is the optimize time, the second time is the actual trsv/trsm function time):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="none"&gt;Device used:
  Name: Intel(R) Data Center GPU Max 1550
  Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
  Global memory: 65536 MiB

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
    L trsV 0:        0.001 ms         51.836 ms
    L trsV 1:       68.626 ms         15.682 ms
    U trsV 0:        0.000 ms         50.325 ms
    U trsV 1:       91.057 ms          8.652 ms
    L trsM 0:        0.000 ms         50.892 ms
    L trsM 1:       69.654 ms       1943.417 ms
    U trsM 0:        0.000 ms        472.677 ms
    U trsM 1:       92.079 ms        877.005 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
    L trsV 0:        0.000 ms        123.272 ms
    L trsV 1:      142.267 ms         36.950 ms
    U trsV 0:        0.000 ms        120.813 ms
    U trsV 1:      176.823 ms         19.336 ms
    L trsM 0:        0.000 ms        176.042 ms
    L trsM 1:      145.710 ms       6662.093 ms
    U trsM 0:        0.001 ms       1492.967 ms
    U trsM 1:      172.043 ms       3043.670 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
    L trsV 0:        0.001 ms        303.964 ms
    L trsV 1:      287.290 ms         81.386 ms
    U trsV 0:        0.000 ms        295.895 ms
    U trsV 1:      340.745 ms         40.465 ms
    L trsM 0:        0.001 ms        629.874 ms
    L trsM 1:      294.028 ms      23408.675 ms
    U trsM 0:        0.000 ms       4215.270 ms
    U trsM 1:      336.799 ms      10028.684 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
    L trsV 0:        0.000 ms        744.029 ms
    L trsV 1:      612.992 ms        205.010 ms
    U trsV 0:        0.000 ms        729.860 ms
    U trsV 1:      720.854 ms        100.706 ms
    L trsM 0:        0.001 ms       2410.369 ms
    L trsM 1:      632.525 ms      84177.905 ms
    U trsM 0:        0.001 ms      12089.900 ms
    U trsM 1:      722.277 ms      36535.894 ms&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For the trsV funciton, it is alright. The optimize is not worth calling for only a single trsv call, but it makes the trsv actually faster, so after a few iterations, the total time will be shorter with optimize.&lt;/P&gt;&lt;P&gt;The trsM is completely bad. The optimize makes the actual trsm call horribly slower. This should not happen. I would understand if the optimize was "not worth its cost", speeding up the trsm only a little bit. But making it actually slower is very unexpected.&lt;/P&gt;&lt;P&gt;Am I doing something wrong? Is this expected? Can this please be fixed?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Jakub&lt;/P&gt;</description>
      <pubDate>Sun, 17 Nov 2024 14:05:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643719#M36650</guid>
      <dc:creator>JakubH</dc:creator>
      <dc:date>2024-11-17T14:05:50Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643966#M36652</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/268232"&gt;@JakubH&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for reaching out to us about your issue. We were able to reproduce your timings on our end and your issue is valid.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I also examined your code. I do want to point out one issue in it, although that is unrelated to the problem you are seeing about trsm() call timings getting significantly worsened (which we will work on fixing). The way oneMKL sparse BLAS domain's optimize_xxx() SYCL APIs are, they can only deliver performance when the &lt;EM&gt;matrix handle&lt;/EM&gt; is&amp;nbsp;&lt;EM&gt;reused&lt;/EM&gt;,&amp;nbsp;not just the input data. This is because internal optimizations for TRSM APIs specific to the sparse matrix data are created and stored in the matrix handle in the optimize_trsm() API call. In your code, you are allocating setting up the handle, calling optimize_trsm and trsm, and freeing the handle, all inside a `for` loop. That causes the internal optimizations to be repeatedly created and destroyed in the `for` loop as well (meaning optimize_trsm timings will increase). Ideally we want the creation and destruction of the matrix handle (and if possible even the call to optimize_xxx() functions) to be placed&amp;nbsp;&lt;EM&gt;outside&lt;/EM&gt; the `for` loop. That would only cause the `optimize_trsm` timings to drop, however, and&amp;nbsp;as mentioned earlier, this is unrelated to your particular report that `trsm` calls have slowed down significantly instead of speeding up.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We will look into this as soon as possible and report back here once we have a fix. Thank you for your patience in the mean time.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Gajanan Choudhary&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Intel(R) oneAPI Math Kernel Library (oneMKL) team&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 16:55:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643966#M36652</guid>
      <dc:creator>Gajanan_Choudhary</dc:creator>
      <dc:date>2024-11-18T16:55:51Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643970#M36653</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;thanks for the reply and effort into fixing this.&lt;/P&gt;&lt;P&gt;I am aware that the optimization data is inside the matrix handle and that I should reuse the handle, this was just a simple example to demonstrate the slowdown. But thanks for pointing it out.&lt;/P&gt;&lt;P&gt;Jakub&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 17:02:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1643970#M36653</guid>
      <dc:creator>JakubH</dc:creator>
      <dc:date>2024-11-18T17:02:49Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1649346#M36744</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/268232"&gt;@JakubH&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thank you for using oneMKL and reaching out to us about this issue.&amp;nbsp; It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good.&amp;nbsp; You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default. &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;/P&gt;&lt;P&gt;The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.&lt;/P&gt;&lt;P&gt;Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality!&amp;nbsp; Let us know if you have further questions or comments.&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Seung-hee&lt;/P&gt;</description>
      <pubDate>Fri, 13 Dec 2024 20:23:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1649346#M36744</guid>
      <dc:creator>shb</dc:creator>
      <dc:date>2024-12-13T20:23:01Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1649790#M36747</link>
      <description>&lt;P&gt;nice, thanks&lt;/P&gt;</description>
      <pubDate>Mon, 16 Dec 2024 09:21:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1649790#M36747</guid>
      <dc:creator>JakubH</dc:creator>
      <dc:date>2024-12-16T09:21:37Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1680531#M37039</link>
      <description>&lt;P&gt;Hi Jakub,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The oneMKL 2025.1 release is now available. Did you get a chance to verify the improvement?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Fengrui&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 23:18:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1680531#M37039</guid>
      <dc:creator>Fengrui</dc:creator>
      <dc:date>2025-04-04T23:18:02Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1680769#M37042</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I don't have access to the Intel GPUs now.&lt;/P&gt;&lt;P&gt;I will report back when I am able to test it.&lt;/P&gt;&lt;P&gt;Jakub&lt;/P&gt;</description>
      <pubDate>Sun, 06 Apr 2025 13:30:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1680769#M37042</guid>
      <dc:creator>JakubH</dc:creator>
      <dc:date>2025-04-06T13:30:23Z</dc:date>
    </item>
    <item>
      <title>Re: sparse::optimize_trsm makes the trsm function slower on GPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1716754#M37330</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/250759"&gt;@Fengrui&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;so I work with Intel GPUs again. I tested the code with Intel toolkit version 2025.2.1, and I can confirm, that for my use case, the TRSM does not slow down anymore after optimize is called.&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is the output I observe on Aurora:&lt;/P&gt;&lt;LI-CODE lang="none"&gt;Device used:
  Name: Intel(R) Data Center GPU Max 1550
  Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
  Global memory: 65536 MiB

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
    L trsV 0:        0.000 ms         52.310 ms
    L trsV 1:       61.868 ms         16.221 ms
    U trsV 0:        0.001 ms         52.682 ms
    U trsV 1:       81.683 ms          8.644 ms
    L trsM 0:        0.001 ms         55.670 ms
    L trsM 1:       62.183 ms         55.382 ms
    U trsM 0:        0.000 ms        503.136 ms
    U trsM 1:       81.908 ms        504.850 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
    L trsV 0:        0.000 ms        123.744 ms
    L trsV 1:      126.362 ms         38.149 ms
    U trsV 0:        0.000 ms        124.754 ms
    U trsV 1:      153.170 ms         20.393 ms
    L trsM 0:        0.000 ms        195.719 ms
    L trsM 1:      129.190 ms        194.775 ms
    U trsM 0:        0.001 ms       1602.427 ms
    U trsM 1:      153.451 ms       1605.654 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
    L trsV 0:        0.000 ms        305.731 ms
    L trsV 1:      254.925 ms         85.023 ms
    U trsV 0:        0.001 ms        308.497 ms
    U trsV 1:      314.559 ms         42.708 ms
    L trsM 0:        0.001 ms        712.193 ms
    L trsM 1:      257.358 ms        712.314 ms
    U trsM 0:        0.000 ms       4575.225 ms
    U trsM 1:      307.589 ms       4575.189 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
    L trsV 0:        0.000 ms        744.812 ms
    L trsV 1:      587.699 ms        206.488 ms
    U trsV 0:        0.000 ms        755.641 ms
    U trsV 1:      664.784 ms        102.924 ms
    L trsM 0:        0.001 ms       2723.421 ms
    L trsM 1:      581.394 ms       2738.595 ms
    U trsM 0:        0.000 ms      13236.210 ms
    U trsM 1:      662.634 ms      13236.147 ms&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 13 Sep 2025 09:14:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/sparse-optimize-trsm-makes-the-trsm-function-slower-on-GPU/m-p/1716754#M37330</guid>
      <dc:creator>JakubH</dc:creator>
      <dc:date>2025-09-13T09:14:54Z</dc:date>
    </item>
  </channel>
</rss>

