Memory leak, mkl::dft on Intel GPU, only on Windows?

NickKMPQ · ‎08-08-2023

Hi, I've been running into a memory leak when doing DFTs using the oneAPI SYCL version of MKL's DFT. I pasted below a simplified program that will reproduce it. Basically, if I use a GPU/default selector to run it, e.g. on the Xe graphics of my laptop, under Windows, it will be occupying 4 GB once it finishes after a few seconds. Running on a cpu_selector_v, this doesn't happen. If I run the same code under Linux, there is no apparent leak on GPU, as well.

The memory use builds up over repeated calls to oneapi::mkl::dft::compute_forward()

Does anyone else see this behavior? Is there something I'm not doing right here?

This is with oneAPI 2023.2.

#include <sycl/sycl.hpp>

#include <oneapi/mkl/dfti.hpp>

#include <complex>

int main() {

const int Ntime = 1024;

const int Nfreq = Ntime / 2 + 1;

const int NrepetitionsFFTs = 1048576/2;

//create queue (default selector in my case will be Xe graphics GPU)

sycl::queue q{ sycl::default_selector_v, sycl::property::queue::in_order() };

//allocate device memory and zero it

float* deviceA = reinterpret_cast<float*>(

sycl::malloc_device(

Ntime * sizeof(float),

q.get_device(),

q.get_context()));

std::complex<float>* deviceB = reinterpret_cast<std::complex<float>*>(

sycl::malloc_device(

Nfreq * sizeof(std::complex<float>),

q.get_device(),

q.get_context()));

q.memset(deviceA, 0, Ntime * sizeof(float)).wait();

//Create DFT descriptor

auto fftDescriptorForward =

oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::REAL>(Ntime);

fftDescriptorForward.set_value(

oneapi::mkl::dft::config_param::PLACEMENT, DFTI_CONFIG_VALUE::DFTI_NOT_INPLACE);

fftDescriptorForward.commit(q);

//repeat fft NrepetitionsFFTs times

for (int j = 0; j < NrepetitionsFFTs; j++) {

oneapi::mkl::dft::compute_forward(fftDescriptorForward, deviceA, deviceB);

}

q.wait();

sycl::free(deviceA, q);

sycl::free(deviceB, q);

#ifdef __linux__

system("read");

#else

system("pause");

#endif

return 0;

}

VarshaS_Intel · ‎08-11-2023

Hi,

Thanks for posting in Intel Communities

Thanks for providing the details. When we tried the sample reproducer code provided by you, we saw that the memory utilization is normal, but the time to run is more in Windows compared to Linux.

Please find the below details of the machine where we are running the code:

CPU - Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz 3.0 [2023.15.3.0.20_160000]

Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics 620 3.0 [31.0.101.2111]

>>it will be occupying 4 GB once it finishes after a few seconds.

Could you please elaborate on your issue and provide us with all the observations from your side on GPU to investigate more from our end?

Besides the memory leak, could you please let us know if you observed any crashes while running the sample code on GPU?

Thanks & Regards,

Varsha

NickKMPQ · ‎08-11-2023

Thanks for the response. I have tried the same compiled binary on two laptops now, and interestingly it only happens on the newer one:

Exhibits memory leak:

12th Gen Intel(R) Core(TM) i7-1270P

Intel(R) Iris(R) Xe Graphics

Does not exhibit leak:

Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz
Intel(R) Iris(R) Graphics 550

I attached screenshots of Task Manager while the test runs on the two machines. You can see the continuous build-up of memory use only on the Xe Graphics model, and if I boot that one into Linux and compile/run the code there, there is no leak. It seems to be specific to Windows and this GPU.

I tried rolling back the graphics driver, but this didn't make a difference. I'm currenly using the newest driver, 31.0.101.4577

If I increase the number of repetitions such that it fills the memory, it will crash, with the message "Abort was called at 268 line in file:"

VarshaS_Intel · ‎08-21-2023

Hi,

We are working on your issue. We will get back to you soon.

And also, could you please let us know how much is the Shared GPU memory while the utilization is 0% on Intel(R) Iris(R) Xe Graphics?

Thanks & Regards,

Varsha

NickKMPQ · ‎08-23-2023

Hello,

thanks for looking into it - on this system, the baseline with everything closed is 0.3/7.8 GB - is that what you are looking for?

Best,

Nick

VarshaS_Intel · ‎09-01-2023

Hi,

Thanks for your reply.

>>is that what you are looking for?

Yes, this is what I am looking for.

We are working on your issue, we will get back to you soon.

Thanks & Regards,

Varsha

VarshaS_Intel · ‎09-22-2023

Hi,

Apologies for the delay in my response and Thanks for your patience.

When we tried on the Intel Iris Xe Graphics, we did not find any high utilization of shared memory issues. At our end, we are using the driver version 31.0.101.4826.

Could you please try to upgrade your drivers then run the code and let us know if you are still facing more utilization of shared memory issues?

Thanks & Regards,

Varsha

VarshaS_Intel · ‎09-29-2023

Hi,

We have not heard back from you. Could you please provide us with an update on the issue?

Thanks & Regards,

Varsha

VarshaS_Intel · ‎10-08-2023

Hi,

We have not heard back from you. Could you please try updating your driver and let us know if you have issue?

Thanks & Regards,

Varsha