GPU Compute Software
Ask questions about Intel® Graphics Compute software technologies, such as OpenCL* GPU driver and oneAPI Level Zero
222 Discussions

sycl failed malloc_device on GPU takes 20 seconds

JakubH
Novice
226 Views

Hello,

 

In SYCL/DPCPP, if the function `sycl::malloc_device` fails (because there is not enough memory available), it takes 20 seconds to return.

I created an example code (see below). There, I make two allocations. If the second allocation exceeds the total memory capacity, it takes 20 seconds to return (see sample outputs below) and returns a nullptr.

I am OK with it returning nullptr, that is reasonable, when there is not enough memory. **But why does it take so long?**

Almost exaclty 20 seconds. The amount of free/occupied/total memory capacity does not make any difference, using PVC1100 or PVC1550 makes no difference, using different intel toolkit versions (2024.0.2 -- 2025.0.0) makes no difference. Even the presence of `aspect::ext_intel_free_memory` makes no difference.

 

I was testing this with a PVC1550 on a Tiber devcloud instance, and with a PVC1100 on the training nodes in the Tiber devcloud.

 

What is going on? I would appreciate any help with this.

 

Jakub

 

--------

 

compile with: `icpx -fsycl -qopenmp source.cpp -o program.x`

run: `./program.x <GiB_first_alloc> <GiB_second_alloc>`

run: `./program.x 60 4`

 

example source code:

```

#include <cstdio>
#include <cstdlib>
#include <omp.h>
#include <sycl/sycl.hpp>

int main(int argc, char ** argv)
{
    if(argc <= 2)
    {
        fprintf(stderr, "not enough arguments\n");
        return 1;
    }
    size_t size1GiB = atoi(argv[1]);
    size_t size2GiB = atoi(argv[2]);

    size_t size1 = size1GiB << 30;
    size_t size2 = size2GiB << 30;

    void * ptr1 = nullptr;
    void * ptr2 = nullptr;

    sycl::device dev(sycl::gpu_selector_v);
    printf("Device:\n");
    printf("  Name: %s\n", dev.get_info<sycl::info::device::name>().c_str());
    printf("  Platform: %s\n", dev.get_info<sycl::info::device::platform>().get_info<sycl::info::platform::name>().c_str());
    printf("  Global memory: %lu MiB\n", dev.get_info<sycl::info::device::global_mem_size>() >> 20);
    printf("  Free memory aspect: %s\n", dev.has(sycl::aspect::ext_intel_free_memory) ? "YES" : "NO");
    sycl::context ctx(dev);
    printf("\n");

    printf("Allocating %zu MiB\n", size1 >> 20);
    double start1 = omp_get_wtime();
    ptr1 = sycl::malloc_device(size1, dev, ctx);
    double stop1 = omp_get_wtime();
    printf("  ptr1: %p\n", ptr1);
    printf("  time: %10.3f ms\n", (stop1 - start1) * 1000.0);
    printf("\n");
   
    printf("Allocating %zu MiB\n", size2 >> 20);
    double start2 = omp_get_wtime();
    ptr2 = sycl::malloc_device(size2, dev, ctx);
    double stop2 = omp_get_wtime();
    printf("  ptr2: %p\n", ptr2);
    printf("  time: %10.3f ms\n", (stop2 - start2) * 1000.0);
    printf("\n");

    sycl::free(ptr1, ctx);
    sycl::free(ptr2, ctx);

    printf("The end\n");

    return 0;
}

```

 

output from PVC1550 Tiber devcloud instance with intel toolkit 2025.0.0, using FLAT mode for the 2-stack gpu:

```

$ ./program.x 60 5
Device:
Name: Intel(R) Data Center GPU Max 1550
Platform: Intel(R) oneAPI Unified Runtime over Level-Zero
Global memory: 65536 MiB
Free memory aspect: NO

Allocating 61440 MiB
ptr1: 0xff00000000200000
time: 21.875 ms

Allocating 5120 MiB
ptr2: (nil)
time: 20050.431 ms

The end

```

(using COMPOSITE mode and running with `./program.x 120 10`, the second allocation still takes 20 seconds)

 

output from PVC1100 on the Tiber training nodes through jupyterlab, intel toolkit 2024.2.1:

```

$ ./program.x 45 5
Device:
Name: Intel(R) Data Center GPU Max 1100
Platform: Intel(R) Level-Zero
Global memory: 49152 MiB
Free memory aspect: YES

Allocating 46080 MiB
ptr1: 0xff00e00000200000
time: 0.219 ms

Allocating 5120 MiB
ptr2: (nil)
time: 20042.670 ms

The end

```

 

edit: 

0 Kudos
0 Replies
Reply