Solved: sparse::optimize_trsm makes the trsm function slower on GPU

JakubH · ‎11-17-2024

Hello,

I am using the sparse::trsm function and I discovered that sparse::optimize_trsm horribly slows down the sparse::trsm function, instead of speeding it up as the name suggests. The sparse::trsm is about 40x slower if I use sparse::optimize_trsm versus if I don't.

I created an example code and matrices where I demonstrate it, see the attachment. There are restrictions to file uploads, so I have to use a onedrive link: onemkl_sparse_trs_optimize.zip

Compile with `make` and run with `make run`.

I think the code does not need much explanation - it just loads the triangular matrices from files, and performs the trsv and trsm function with an optional optimize. I test both lower and upper triangular matrices. The matrices are actual matrices I exported from the app where I use them.

I measure the time as an average of 3 runs, there is one extra warmup run. I use the 2025.0.0 version of the Intel toolkit, and a Datacenter GPU Max 1550 GPU on a Tiber devcloud instance.

This is the output of the example code that I observe (in each row of the output - L is lower triangular, U is upper triangular; trsV and trsM kernels; 0 for don't optimize, 1 means I use the optimize function; the first time is the optimize time, the second time is the actual trsv/trsm function time):

Device used:
  Name: Intel(R) Data Center GPU Max 1550
  Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
  Global memory: 65536 MiB

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
    L trsV 0:        0.001 ms         51.836 ms
    L trsV 1:       68.626 ms         15.682 ms
    U trsV 0:        0.000 ms         50.325 ms
    U trsV 1:       91.057 ms          8.652 ms
    L trsM 0:        0.000 ms         50.892 ms
    L trsM 1:       69.654 ms       1943.417 ms
    U trsM 0:        0.000 ms        472.677 ms
    U trsM 1:       92.079 ms        877.005 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
    L trsV 0:        0.000 ms        123.272 ms
    L trsV 1:      142.267 ms         36.950 ms
    U trsV 0:        0.000 ms        120.813 ms
    U trsV 1:      176.823 ms         19.336 ms
    L trsM 0:        0.000 ms        176.042 ms
    L trsM 1:      145.710 ms       6662.093 ms
    U trsM 0:        0.001 ms       1492.967 ms
    U trsM 1:      172.043 ms       3043.670 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
    L trsV 0:        0.001 ms        303.964 ms
    L trsV 1:      287.290 ms         81.386 ms
    U trsV 0:        0.000 ms        295.895 ms
    U trsV 1:      340.745 ms         40.465 ms
    L trsM 0:        0.001 ms        629.874 ms
    L trsM 1:      294.028 ms      23408.675 ms
    U trsM 0:        0.000 ms       4215.270 ms
    U trsM 1:      336.799 ms      10028.684 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
    L trsV 0:        0.000 ms        744.029 ms
    L trsV 1:      612.992 ms        205.010 ms
    U trsV 0:        0.000 ms        729.860 ms
    U trsV 1:      720.854 ms        100.706 ms
    L trsM 0:        0.001 ms       2410.369 ms
    L trsM 1:      632.525 ms      84177.905 ms
    U trsM 0:        0.001 ms      12089.900 ms
    U trsM 1:      722.277 ms      36535.894 ms

For the trsV funciton, it is alright. The optimize is not worth calling for only a single trsv call, but it makes the trsv actually faster, so after a few iterations, the total time will be shorter with optimize.

The trsM is completely bad. The optimize makes the actual trsm call horribly slower. This should not happen. I would understand if the optimize was "not worth its cost", speeding up the trsm only a little bit. But making it actually slower is very unexpected.

Am I doing something wrong? Is this expected? Can this please be fixed?

Thanks,

Jakub

shb · ‎12-13-2024

Hello @JakubH ,

Thank you for using oneMKL and reaching out to us about this issue. It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good. You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default.

The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.

Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality! Let us know if you have further questions or comments.

Best,

Seung-hee

View solution in original post

Gajanan_Choudhary · ‎11-18-2024

Hi @JakubH,

Thanks for reaching out to us about your issue. We were able to reproduce your timings on our end and your issue is valid.

I also examined your code. I do want to point out one issue in it, although that is unrelated to the problem you are seeing about trsm() call timings getting significantly worsened (which we will work on fixing). The way oneMKL sparse BLAS domain's optimize_xxx() SYCL APIs are, they can only deliver performance when the matrix handle is reused, not just the input data. This is because internal optimizations for TRSM APIs specific to the sparse matrix data are created and stored in the matrix handle in the optimize_trsm() API call. In your code, you are allocating setting up the handle, calling optimize_trsm and trsm, and freeing the handle, all inside a `for` loop. That causes the internal optimizations to be repeatedly created and destroyed in the `for` loop as well (meaning optimize_trsm timings will increase). Ideally we want the creation and destruction of the matrix handle (and if possible even the call to optimize_xxx() functions) to be placed outside the `for` loop. That would only cause the `optimize_trsm` timings to drop, however, and as mentioned earlier, this is unrelated to your particular report that `trsm` calls have slowed down significantly instead of speeding up.

We will look into this as soon as possible and report back here once we have a fix. Thank you for your patience in the mean time.

Regards,

Gajanan Choudhary

Intel(R) oneAPI Math Kernel Library (oneMKL) team

JakubH · ‎11-18-2024

Hi,

thanks for the reply and effort into fixing this.

I am aware that the optimization data is inside the matrix handle and that I should reuse the handle, this was just a simple example to demonstrate the slowdown. But thanks for pointing it out.

Jakub

shb · ‎12-13-2024

Hello @JakubH ,

Thank you for using oneMKL and reaching out to us about this issue. It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good. You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default.

The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.

Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality! Let us know if you have further questions or comments.

Best,

Seung-hee

JakubH · ‎12-16-2024

nice, thanks

Fengrui · ‎04-04-2025

Hi Jakub,

The oneMKL 2025.1 release is now available. Did you get a chance to verify the improvement?

Thanks,

Fengrui

JakubH · ‎04-06-2025

Hi,

I don't have access to the Intel GPUs now.

I will report back when I am able to test it.

Jakub

JakubH · ‎09-13-2025

Hi @Fengrui ,

so I work with Intel GPUs again. I tested the code with Intel toolkit version 2025.2.1, and I can confirm, that for my use case, the TRSM does not slow down anymore after optimize is called.

Thanks.

Here is the output I observe on Aurora:

Device used:
  Name: Intel(R) Data Center GPU Max 1550
  Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
  Global memory: 65536 MiB

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
    L trsV 0:        0.000 ms         52.310 ms
    L trsV 1:       61.868 ms         16.221 ms
    U trsV 0:        0.001 ms         52.682 ms
    U trsV 1:       81.683 ms          8.644 ms
    L trsM 0:        0.001 ms         55.670 ms
    L trsM 1:       62.183 ms         55.382 ms
    U trsM 0:        0.000 ms        503.136 ms
    U trsM 1:       81.908 ms        504.850 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
    L trsV 0:        0.000 ms        123.744 ms
    L trsV 1:      126.362 ms         38.149 ms
    U trsV 0:        0.000 ms        124.754 ms
    U trsV 1:      153.170 ms         20.393 ms
    L trsM 0:        0.000 ms        195.719 ms
    L trsM 1:      129.190 ms        194.775 ms
    U trsM 0:        0.001 ms       1602.427 ms
    U trsM 1:      153.451 ms       1605.654 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
    L trsV 0:        0.000 ms        305.731 ms
    L trsV 1:      254.925 ms         85.023 ms
    U trsV 0:        0.001 ms        308.497 ms
    U trsV 1:      314.559 ms         42.708 ms
    L trsM 0:        0.001 ms        712.193 ms
    L trsM 1:      257.358 ms        712.314 ms
    U trsM 0:        0.000 ms       4575.225 ms
    U trsM 1:      307.589 ms       4575.189 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
    L trsV 0:        0.000 ms        744.812 ms
    L trsV 1:      587.699 ms        206.488 ms
    U trsV 0:        0.000 ms        755.641 ms
    U trsV 1:      664.784 ms        102.924 ms
    L trsM 0:        0.001 ms       2723.421 ms
    L trsM 1:      581.394 ms       2738.595 ms
    U trsM 0:        0.000 ms      13236.210 ms
    U trsM 1:      662.634 ms      13236.147 ms