Solved: manual memory management on GPU

Jakub_H · ‎09-02-2024

Hello,

is it possible to manually manage all the memory that oneMKL needs?

Here (https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2024-2/onemkl-initialization-on-gpu.html), it says, that one of the phases is memory allocation/free.

However, I would like to manage all the GPU memory by myself, such that oneMKL would not perform any allocations by itself. On the linked page, there are links to some support functions and redefining memory functions, but this (1) seems like it is CPU-side only, and (2) does not really cut it for me.

I would rather like to query the buffersize the particular oneMKL function will need, allocate the memory myself, and then pass the memory to the oneMKL function as a workspace buffer. I am able to respect some rules about alignment and not touching the buffer between calls.

I am trying to utilize the GPU memory to the maximum. But I feel like am not fully able to, since I think I need to let some memory free for the oneMKL functions to allocate, and I don't know how much. I did not get into problems with running out of memory yet, but I cannot be guaranteed it will never happen.

onemkl::blas and onemkl::sparse are my main concerns.

CUDA and roc* (AMD) libraries do have this style of memory management (*_buffersize functions and workspace buffer parameter in cuSPARSE; buffersize stage and buffer parameters in rocSPARSE; rocblas_start_device_memory_size_query and rocblas_set_workspace for rocblas).

For Intel GPUs, I would like to achieve something similar with oneMKL blas and sparse.

I did not find any more info about this in the docs. Is it possible, did I miss something?

If it is not possible, let this be a feature request.

Thanks,

Jakub

Gajanan_Choudhary · ‎09-19-2024

Hi @Jakub_H,

In addition to @Fengrui's reply, here are some more details applicable to the Intel oneMKL Sparse BLAS domain on the SYCL APIs side:

As of oneMKL 2024.2.x release(s), what you want is only (partially) possible with the sparse * sparse = sparse matrix API, oneapi::mkl::sparse::matmat and in the next release, also for an upcoming sparse + sparse = sparse matrix addition API. We unfortunately do not have this feature of user-provided temporary workspaces available in our base level 2/3 APIs (gemv/trmv/symv/trsv/gemm APIs). For these and some other APIs, when required, oneMKL internally allocates, maintains, and frees temporary workspace(s). Any such temporary workspace lives as long as the sparse::matrix_handle_t does, and is freed in the call to oneapi::mkl::sparse::release_matrix_handle. These internal optimizations are used across different sparse BLAS API calls. For example, if oneMKL decides to generate an internal transpose of the matrix, then it may be used for subsequent calls to speed up computations. Of course, that can be detrimental to certain use cases (particularly if you were to use a matrix handle just once and then free it immediately).

We recognize a growing population of users that desire full control over memory versus simplicity in the API, and we are working to figure out how to balance between the two. The first step that we already implemented towards that (different from the Inspector-Executor Sparse BLAS C/Fortran APIs) was having users always maintain ownership of sparse matrix arrays instead of the library allocating and owning the sparse matrix arrays in some cases such as matmat. The second step (not yet fully realized yet) will be working out how to manage temporary workspaces for reuse of APIs and with clear lifetimes when they are provided. Ideally we want the user-provided workspaces to be optional rather than compulsory as some other libraries have it, but it will take some work.

We do have a plan to introduce APIs with user-provided temporary workspaces on our roadmap but it will take us some time to get there. You can check out the upcoming version of the oneAPI Specification, e.g., spmv, for an idea on what the APIs may look like some day (but not necessarily exactly the same).

Hope that helps,

Gajanan

View solution in original post

Fengrui · ‎09-19-2024

Hello Jakub,

Thank you for posting in the forum! There has been discussion in the oneMKL team about this GPU memory manager feature. The current plan is to wait for the SYCL runtime to implement the memory pooling features.

Thanks,

Fengrui

Gajanan_Choudhary · ‎09-19-2024

Hi @Jakub_H,

In addition to @Fengrui's reply, here are some more details applicable to the Intel oneMKL Sparse BLAS domain on the SYCL APIs side:

As of oneMKL 2024.2.x release(s), what you want is only (partially) possible with the sparse * sparse = sparse matrix API, oneapi::mkl::sparse::matmat and in the next release, also for an upcoming sparse + sparse = sparse matrix addition API. We unfortunately do not have this feature of user-provided temporary workspaces available in our base level 2/3 APIs (gemv/trmv/symv/trsv/gemm APIs). For these and some other APIs, when required, oneMKL internally allocates, maintains, and frees temporary workspace(s). Any such temporary workspace lives as long as the sparse::matrix_handle_t does, and is freed in the call to oneapi::mkl::sparse::release_matrix_handle. These internal optimizations are used across different sparse BLAS API calls. For example, if oneMKL decides to generate an internal transpose of the matrix, then it may be used for subsequent calls to speed up computations. Of course, that can be detrimental to certain use cases (particularly if you were to use a matrix handle just once and then free it immediately).

We recognize a growing population of users that desire full control over memory versus simplicity in the API, and we are working to figure out how to balance between the two. The first step that we already implemented towards that (different from the Inspector-Executor Sparse BLAS C/Fortran APIs) was having users always maintain ownership of sparse matrix arrays instead of the library allocating and owning the sparse matrix arrays in some cases such as matmat. The second step (not yet fully realized yet) will be working out how to manage temporary workspaces for reuse of APIs and with clear lifetimes when they are provided. Ideally we want the user-provided workspaces to be optional rather than compulsory as some other libraries have it, but it will take some work.

We do have a plan to introduce APIs with user-provided temporary workspaces on our roadmap but it will take us some time to get there. You can check out the upcoming version of the oneAPI Specification, e.g., spmv, for an idea on what the APIs may look like some day (but not necessarily exactly the same).

Hope that helps,

Gajanan

Jakub_H · ‎09-20-2024

Hi,

thanks for the detailed answer. So currently it works as I thought it does.

And it looks like things are moving in the right direction from my point of view. I would appreciate having the option, not the obligation to manually manage memory and I understand that finding the right balance is not easy.

I'm looking forward to what the future will bring, it looks promising. Thanks,

Jakub