- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using the sparse::trsm function and I discovered that sparse::optimize_trsm horribly slows down the sparse::trsm function, instead of speeding it up as the name suggests. The sparse::trsm is about 40x slower if I use sparse::optimize_trsm versus if I don't.
I created an example code and matrices where I demonstrate it, see the attachment. There are restrictions to file uploads, so I have to use a onedrive link: onemkl_sparse_trs_optimize.zip
Compile with `make` and run with `make run`.
I think the code does not need much explanation - it just loads the triangular matrices from files, and performs the trsv and trsm function with an optional optimize. I test both lower and upper triangular matrices. The matrices are actual matrices I exported from the app where I use them.
I measure the time as an average of 3 runs, there is one extra warmup run. I use the 2025.0.0 version of the Intel toolkit, and a Datacenter GPU Max 1550 GPU on a Tiber devcloud instance.
This is the output of the example code that I observe (in each row of the output - L is lower triangular, U is upper triangular; trsV and trsM kernels; 0 for don't optimize, 1 means I use the optimize function; the first time is the optimize time, the second time is the actual trsv/trsm function time):
Device used:
Name: Intel(R) Data Center GPU Max 1550
Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
Global memory: 65536 MiB
System matrix13.txt, size=2744, nrhs=984, nnz=249065:
L trsV 0: 0.001 ms 51.836 ms
L trsV 1: 68.626 ms 15.682 ms
U trsV 0: 0.000 ms 50.325 ms
U trsV 1: 91.057 ms 8.652 ms
L trsM 0: 0.000 ms 50.892 ms
L trsM 1: 69.654 ms 1943.417 ms
U trsM 0: 0.000 ms 472.677 ms
U trsM 1: 92.079 ms 877.005 ms
System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
L trsV 0: 0.000 ms 123.272 ms
L trsV 1: 142.267 ms 36.950 ms
U trsV 0: 0.000 ms 120.813 ms
U trsV 1: 176.823 ms 19.336 ms
L trsM 0: 0.000 ms 176.042 ms
L trsM 1: 145.710 ms 6662.093 ms
U trsM 0: 0.001 ms 1492.967 ms
U trsM 1: 172.043 ms 3043.670 ms
System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
L trsV 0: 0.001 ms 303.964 ms
L trsV 1: 287.290 ms 81.386 ms
U trsV 0: 0.000 ms 295.895 ms
U trsV 1: 340.745 ms 40.465 ms
L trsM 0: 0.001 ms 629.874 ms
L trsM 1: 294.028 ms 23408.675 ms
U trsM 0: 0.000 ms 4215.270 ms
U trsM 1: 336.799 ms 10028.684 ms
System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
L trsV 0: 0.000 ms 744.029 ms
L trsV 1: 612.992 ms 205.010 ms
U trsV 0: 0.000 ms 729.860 ms
U trsV 1: 720.854 ms 100.706 ms
L trsM 0: 0.001 ms 2410.369 ms
L trsM 1: 632.525 ms 84177.905 ms
U trsM 0: 0.001 ms 12089.900 ms
U trsM 1: 722.277 ms 36535.894 ms
For the trsV funciton, it is alright. The optimize is not worth calling for only a single trsv call, but it makes the trsv actually faster, so after a few iterations, the total time will be shorter with optimize.
The trsM is completely bad. The optimize makes the actual trsm call horribly slower. This should not happen. I would understand if the optimize was "not worth its cost", speeding up the trsm only a little bit. But making it actually slower is very unexpected.
Am I doing something wrong? Is this expected? Can this please be fixed?
Thanks,
Jakub
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @JakubH,
Thanks for reaching out to us about your issue. We were able to reproduce your timings on our end and your issue is valid.
I also examined your code. I do want to point out one issue in it, although that is unrelated to the problem you are seeing about trsm() call timings getting significantly worsened (which we will work on fixing). The way oneMKL sparse BLAS domain's optimize_xxx() SYCL APIs are, they can only deliver performance when the matrix handle is reused, not just the input data. This is because internal optimizations for TRSM APIs specific to the sparse matrix data are created and stored in the matrix handle in the optimize_trsm() API call. In your code, you are allocating setting up the handle, calling optimize_trsm and trsm, and freeing the handle, all inside a `for` loop. That causes the internal optimizations to be repeatedly created and destroyed in the `for` loop as well (meaning optimize_trsm timings will increase). Ideally we want the creation and destruction of the matrix handle (and if possible even the call to optimize_xxx() functions) to be placed outside the `for` loop. That would only cause the `optimize_trsm` timings to drop, however, and as mentioned earlier, this is unrelated to your particular report that `trsm` calls have slowed down significantly instead of speeding up.
We will look into this as soon as possible and report back here once we have a fix. Thank you for your patience in the mean time.
Regards,
Gajanan Choudhary
Intel(R) oneAPI Math Kernel Library (oneMKL) team
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for the reply and effort into fixing this.
I am aware that the optimization data is inside the matrix handle and that I should reuse the handle, this was just a simple example to demonstrate the slowdown. But thanks for pointing it out.
Jakub
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page