Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7125 Discussions

sparse::optimize_trsm makes the trsm function slower on GPU

JakubH
New Contributor I
485 Views

Hello,

I am using the sparse::trsm function and I discovered that sparse::optimize_trsm horribly slows down the sparse::trsm function, instead of speeding it up as the name suggests. The sparse::trsm is about 40x slower if I use sparse::optimize_trsm versus if I don't.

I created an example code and matrices where I demonstrate it, see the attachment. There are restrictions to file uploads, so I have to use a onedrive link: onemkl_sparse_trs_optimize.zip

 

 

Compile with `make` and run with `make run`.

I think the code does not need much explanation - it just loads the triangular matrices from files, and performs the trsv and trsm function with an optional optimize. I test both lower and upper triangular matrices. The matrices are actual matrices I exported from the app where I use them.

I measure the time as an average of 3 runs, there is one extra warmup run. I use the 2025.0.0 version of the Intel toolkit, and a Datacenter GPU Max 1550 GPU on a Tiber devcloud instance.

This is the output of the example code that I observe (in each row of the output - L is lower triangular, U is upper triangular; trsV and trsM kernels; 0 for don't optimize, 1 means I use the optimize function; the first time is the optimize time, the second time is the actual trsv/trsm function time):

 

 

Device used:
  Name: Intel(R) Data Center GPU Max 1550
  Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
  Global memory: 65536 MiB

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
    L trsV 0:        0.001 ms         51.836 ms
    L trsV 1:       68.626 ms         15.682 ms
    U trsV 0:        0.000 ms         50.325 ms
    U trsV 1:       91.057 ms          8.652 ms
    L trsM 0:        0.000 ms         50.892 ms
    L trsM 1:       69.654 ms       1943.417 ms
    U trsM 0:        0.000 ms        472.677 ms
    U trsM 1:       92.079 ms        877.005 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
    L trsV 0:        0.000 ms        123.272 ms
    L trsV 1:      142.267 ms         36.950 ms
    U trsV 0:        0.000 ms        120.813 ms
    U trsV 1:      176.823 ms         19.336 ms
    L trsM 0:        0.000 ms        176.042 ms
    L trsM 1:      145.710 ms       6662.093 ms
    U trsM 0:        0.001 ms       1492.967 ms
    U trsM 1:      172.043 ms       3043.670 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
    L trsV 0:        0.001 ms        303.964 ms
    L trsV 1:      287.290 ms         81.386 ms
    U trsV 0:        0.000 ms        295.895 ms
    U trsV 1:      340.745 ms         40.465 ms
    L trsM 0:        0.001 ms        629.874 ms
    L trsM 1:      294.028 ms      23408.675 ms
    U trsM 0:        0.000 ms       4215.270 ms
    U trsM 1:      336.799 ms      10028.684 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
    L trsV 0:        0.000 ms        744.029 ms
    L trsV 1:      612.992 ms        205.010 ms
    U trsV 0:        0.000 ms        729.860 ms
    U trsV 1:      720.854 ms        100.706 ms
    L trsM 0:        0.001 ms       2410.369 ms
    L trsM 1:      632.525 ms      84177.905 ms
    U trsM 0:        0.001 ms      12089.900 ms
    U trsM 1:      722.277 ms      36535.894 ms

 

 

For the trsV funciton, it is alright. The optimize is not worth calling for only a single trsv call, but it makes the trsv actually faster, so after a few iterations, the total time will be shorter with optimize.

The trsM is completely bad. The optimize makes the actual trsm call horribly slower. This should not happen. I would understand if the optimize was "not worth its cost", speeding up the trsm only a little bit. But making it actually slower is very unexpected.

Am I doing something wrong? Is this expected? Can this please be fixed?

Thanks,

Jakub

0 Kudos
1 Solution
shb
Employee
246 Views

Hello @JakubH ,

Thank you for using oneMKL and reaching out to us about this issue.  It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good.  You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default.

The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.

Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality!  Let us know if you have further questions or comments.

Best,

Seung-hee

View solution in original post

0 Kudos
4 Replies
Gajanan_Choudhary
394 Views

Hi @JakubH,

 

Thanks for reaching out to us about your issue. We were able to reproduce your timings on our end and your issue is valid.

 

I also examined your code. I do want to point out one issue in it, although that is unrelated to the problem you are seeing about trsm() call timings getting significantly worsened (which we will work on fixing). The way oneMKL sparse BLAS domain's optimize_xxx() SYCL APIs are, they can only deliver performance when the matrix handle is reused, not just the input data. This is because internal optimizations for TRSM APIs specific to the sparse matrix data are created and stored in the matrix handle in the optimize_trsm() API call. In your code, you are allocating setting up the handle, calling optimize_trsm and trsm, and freeing the handle, all inside a `for` loop. That causes the internal optimizations to be repeatedly created and destroyed in the `for` loop as well (meaning optimize_trsm timings will increase). Ideally we want the creation and destruction of the matrix handle (and if possible even the call to optimize_xxx() functions) to be placed outside the `for` loop. That would only cause the `optimize_trsm` timings to drop, however, and as mentioned earlier, this is unrelated to your particular report that `trsm` calls have slowed down significantly instead of speeding up.

 

We will look into this as soon as possible and report back here once we have a fix. Thank you for your patience in the mean time.

 

Regards,

Gajanan Choudhary

Intel(R) oneAPI Math Kernel Library (oneMKL) team

0 Kudos
JakubH
New Contributor I
389 Views

Hi,

thanks for the reply and effort into fixing this.

I am aware that the optimization data is inside the matrix handle and that I should reuse the handle, this was just a simple example to demonstrate the slowdown. But thanks for pointing it out.

Jakub

0 Kudos
shb
Employee
247 Views

Hello @JakubH ,

Thank you for using oneMKL and reaching out to us about this issue.  It happens not because trsM is bad but because trsM without calling optimize_trsm() is actually pretty good.  You can see if you compare the runtime of trsV * nrhs with the runtime of the corresponding trsM default.

The issue is fixed internally and the fixed version will be available in oneMKL 2025.1 release, so trsM 1, at least, won't be worse than the corresponding trsM 0.

Thank you for your report, which helps a lot to improve oneMKL sparse::trsm() functionality!  Let us know if you have further questions or comments.

Best,

Seung-hee

0 Kudos
JakubH
New Contributor I
206 Views
0 Kudos
Reply