topic Re: sparse::optimize_trsm makes the trsm function slower on GPU in Intel® oneAPI Math Kernel Library

sparse::optimize_trsm makes the trsm function slower on GPU

JakubH — Sun, 17 Nov 2024 14:05:50 GMT

Hello,

I am using the sparse::trsm function and I discovered that sparse::optimize_trsm horribly slows down the sparse::trsm function, instead of speeding it up as the name suggests. The sparse::trsm is about 40x slower if I use sparse::optimize_trsm versus if I don't.

I created an example code and matrices where I demonstrate it, see the attachment. There are restrictions to file uploads, so I have to use a onedrive link: onemkl_sparse_trs_optimize.zip

Compile with `make` and run with `make run`.

I think the code does not need much explanation - it just loads the triangular matrices from files, and performs the trsv and trsm function with an optional optimize. I test both lower and upper triangular matrices. The matrices are actual matrices I exported from the app where I use them.

I measure the time as an average of 3 runs, there is one extra warmup run. I use the 2025.0.0 version of the Intel toolkit, and a Datacenter GPU Max 1550 GPU on a Tiber devcloud instance.

This is the output of the example code that I observe (in each row of the output - L is lower triangular, U is upper triangular; trsV and trsM kernels; 0 for don't optimize, 1 means I use the optimize function; the first time is the optimize time, the second time is the actual trsv/trsm function time):

Device used: Name: Intel(R) Data Center GPU Max 1550 Platform: Intel(R) oneAPI Unified Runtime over Level-Zero Global memory: 65536 MiB System matrix13.txt, size=2744, nrhs=984, nnz=249065: L trsV 0: 0.001 ms 51.836 ms L trsV 1: 68.626 ms 15.682 ms U trsV 0: 0.000 ms 50.325 ms U trsV 1: 91.057 ms 8.652 ms L trsM 0: 0.000 ms 50.892 ms L trsM 1: 69.654 ms 1943.417 ms U trsM 0: 0.000 ms 472.677 ms U trsM 1: 92.079 ms 877.005 ms System matrix16.txt, size=4913, nrhs=1450, nnz=593851: L trsV 0: 0.000 ms 123.272 ms L trsV 1: 142.267 ms 36.950 ms U trsV 0: 0.000 ms 120.813 ms U trsV 1: 176.823 ms 19.336 ms L trsM 0: 0.000 ms 176.042 ms L trsM 1: 145.710 ms 6662.093 ms U trsM 0: 0.001 ms 1492.967 ms U trsM 1: 172.043 ms 3043.670 ms System matrix20.txt, size=9261, nrhs=2210, nnz=1468975: L trsV 0: 0.001 ms 303.964 ms L trsV 1: 287.290 ms 81.386 ms U trsV 0: 0.000 ms 295.895 ms U trsV 1: 340.745 ms 40.465 ms L trsM 0: 0.001 ms 629.874 ms L trsM 1: 294.028 ms 23408.675 ms U trsM 0: 0.000 ms 4215.270 ms U trsM 1: 336.799 ms 10028.684 ms System matrix25.txt, size=17576, nrhs=3386, nnz=3605929: L trsV 0: 0.000 ms 744.029 ms L trsV 1: 612.992 ms 205.010 ms U trsV 0: 0.000 ms 729.860 ms U trsV 1: 720.854 ms 100.706 ms L trsM 0: 0.001 ms 2410.369 ms L trsM 1: 632.525 ms 84177.905 ms U trsM 0: 0.001 ms 12089.900 ms U trsM 1: 722.277 ms 36535.894 ms

For the trsV funciton, it is alright. The optimize is not worth calling for only a single trsv call, but it makes the trsv actually faster, so after a few iterations, the total time will be shorter with optimize.

The trsM is completely bad. The optimize makes the actual trsm call horribly slower. This should not happen. I would understand if the optimize was "not worth its cost", speeding up the trsm only a little bit. But making it actually slower is very unexpected.

Am I doing something wrong? Is this expected? Can this please be fixed?

Thanks,

Jakub

Re: sparse::optimize_trsm makes the trsm function slower on GPU

Gajanan_Choudhary — Mon, 18 Nov 2024 16:55:51 GMT

Hi @JakubH,

Thanks for reaching out to us about your issue. We were able to reproduce your timings on our end and your issue is valid.

I also examined your code. I do want to point out one issue in it, although that is unrelated to the problem you are seeing about trsm() call timings getting significantly worsened (which we will work on fixing). The way oneMKL sparse BLAS domain's optimize_xxx() SYCL APIs are, they can only deliver performance when the matrix handle is reused, not just the input data. This is because internal optimizations for TRSM APIs specific to the sparse matrix data are created and stored in the matrix handle in the optimize_trsm() API call. In your code, you are allocating setting up the handle, calling optimize_trsm and trsm, and freeing the handle, all inside a `for` loop. That causes the internal optimizations to be repeatedly created and destroyed in the `for` loop as well (meaning optimize_trsm timings will increase). Ideally we want the creation and destruction of the matrix handle (and if possible even the call to optimize_xxx() functions) to be placed outside the `for` loop. That would only cause the `optimize_trsm` timings to drop, however, and as mentioned earlier, this is unrelated to your particular report that `trsm` calls have slowed down significantly instead of speeding up.

We will look into this as soon as possible and report back here once we have a fix. Thank you for your patience in the mean time.

Regards,

Gajanan Choudhary

Intel(R) oneAPI Math Kernel Library (oneMKL) team

Re: sparse::optimize_trsm makes the trsm function slower on GPU

JakubH — Mon, 18 Nov 2024 17:02:49 GMT

Hi,

thanks for the reply and effort into fixing this.

I am aware that the optimization data is inside the matrix handle and that I should reuse the handle, this was just a simple example to demonstrate the slowdown. But thanks for pointing it out.

Jakub

Re: sparse::optimize_trsm makes the trsm function slower on GPU

shb — Fri, 13 Dec 2024 20:23:01 GMT

Hello @JakubH ,

Thank you for using oneMKL and reaching out to us about this issue. It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good. You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default.

The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.

Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality! Let us know if you have further questions or comments.

Best,

Seung-hee

Re: sparse::optimize_trsm makes the trsm function slower on GPU

JakubH — Mon, 16 Dec 2024 09:21:37 GMT

nice, thanks

Re: sparse::optimize_trsm makes the trsm function slower on GPU

Fengrui — Fri, 04 Apr 2025 23:18:02 GMT

Hi Jakub,

The oneMKL 2025.1 release is now available. Did you get a chance to verify the improvement?

Thanks,

Fengrui

Re: sparse::optimize_trsm makes the trsm function slower on GPU

JakubH — Sun, 06 Apr 2025 13:30:23 GMT

Hi,

I don't have access to the Intel GPUs now.

I will report back when I am able to test it.

Jakub

Re: sparse::optimize_trsm makes the trsm function slower on GPU

JakubH — Sat, 13 Sep 2025 09:14:54 GMT

Hi @Fengrui ,

so I work with Intel GPUs again. I tested the code with Intel toolkit version 2025.2.1, and I can confirm, that for my use case, the TRSM does not slow down anymore after optimize is called.

Thanks.

Here is the output I observe on Aurora:

Device used: Name: Intel(R) Data Center GPU Max 1550 Platform: Intel(R) oneAPI Unified Runtime over Level-Zero Global memory: 65536 MiB System matrix13.txt, size=2744, nrhs=984, nnz=249065: L trsV 0: 0.000 ms 52.310 ms L trsV 1: 61.868 ms 16.221 ms U trsV 0: 0.001 ms 52.682 ms U trsV 1: 81.683 ms 8.644 ms L trsM 0: 0.001 ms 55.670 ms L trsM 1: 62.183 ms 55.382 ms U trsM 0: 0.000 ms 503.136 ms U trsM 1: 81.908 ms 504.850 ms System matrix16.txt, size=4913, nrhs=1450, nnz=593851: L trsV 0: 0.000 ms 123.744 ms L trsV 1: 126.362 ms 38.149 ms U trsV 0: 0.000 ms 124.754 ms U trsV 1: 153.170 ms 20.393 ms L trsM 0: 0.000 ms 195.719 ms L trsM 1: 129.190 ms 194.775 ms U trsM 0: 0.001 ms 1602.427 ms U trsM 1: 153.451 ms 1605.654 ms System matrix20.txt, size=9261, nrhs=2210, nnz=1468975: L trsV 0: 0.000 ms 305.731 ms L trsV 1: 254.925 ms 85.023 ms U trsV 0: 0.001 ms 308.497 ms U trsV 1: 314.559 ms 42.708 ms L trsM 0: 0.001 ms 712.193 ms L trsM 1: 257.358 ms 712.314 ms U trsM 0: 0.000 ms 4575.225 ms U trsM 1: 307.589 ms 4575.189 ms System matrix25.txt, size=17576, nrhs=3386, nnz=3605929: L trsV 0: 0.000 ms 744.812 ms L trsV 1: 587.699 ms 206.488 ms U trsV 0: 0.000 ms 755.641 ms U trsV 1: 664.784 ms 102.924 ms L trsM 0: 0.001 ms 2723.421 ms L trsM 1: 581.394 ms 2738.595 ms U trsM 0: 0.000 ms 13236.210 ms U trsM 1: 662.634 ms 13236.147 ms