sycl::malloc_host fails after 1005 allocations

JakubH · ‎11-18-2024

Hello,

I stumbled upon a problem with host allocations in SYCL. See the following example code. It just performs sycl::malloc_host multiple times in a loop and reports which one failed.

#include <cstdio>
#include <cstdlib>
#include <vector>
#include <sycl/sycl.hpp>

int main(int argc, char ** argv)
{
    size_t chunk_size = 100000;
    size_t num_chunks = 10000;
    if(argc > 1) chunk_size = atoi(argv[1]);
    if(argc > 2) num_chunks = atoi(argv[2]);

    printf("Allocation size is %zu B\n", chunk_size);

    sycl::device d(sycl::gpu_selector_v);
    sycl::queue q(d);

    std::vector<void*> ptrs(num_chunks);

    size_t total_fails = 0;

    for(size_t i = 0; i < num_chunks; i++)
    {
        ptrs[i] = sycl::malloc_host(chunk_size, q);
        if(ptrs[i] == nullptr)
        {
            if(total_fails == 0)
            {
                printf("First failed alloc at iteration %zu\n", i);
                size_t allocced = chunk_size * i;
                printf("Total allocated so far: %zu B = %zu KiB = %zu MiB\n", allocced, allocced >> 10, allocced >> 20);
            }
            total_fails++;
        }
    }

    printf("Total failed allocations: %zu\n", total_fails);
    
    for(size_t i = 0; i < num_chunks; i++)
    {
        sycl::free(ptrs[i], q);
    }

    return 0;
}

Compile simply with `icpx -fsycl source.cpp -o program.x`

Running it yields the following output:

$ ./program.x
Allocation size is 100000 B
First failed alloc at iteration 1005
Total allocated so far: 100500000 B = 98144 KiB = 95 MiB
Total failed allocations: 8995

As you can see, it failed after 1005 allocations.

Increasing the allocation size from 100k to 1M:

$ ./program.x 1000000
Allocation size is 1000000 B
First failed alloc at iteration 1005
Total allocated so far: 1005000000 B = 981445 KiB = 958 MiB
Total failed allocations: 8995

you can see that it again fails after 1005 allocations. Same happens with 10M.

Going down with the allocation size, the number of successfull allocation increases, but it is still a multiple of 1005:

$ ./program.x 30000
Allocation size is 30000 B
First failed alloc at iteration 2010
Total allocated so far: 60300000 B = 58886 KiB = 57 MiB
Total failed allocations: 7990

$ ./program.x 15000
Allocation size is 15000 B
First failed alloc at iteration 4020
Total allocated so far: 60300000 B = 58886 KiB = 57 MiB
Total failed allocations: 5980

$ ./program.x 10000
Allocation size is 10000 B
First failed alloc at iteration 5025
Total allocated so far: 50250000 B = 49072 KiB = 47 MiB
Total failed allocations: 4975

$ ./program.x 2000 100000
Allocation size is 2000 B
First failed alloc at iteration 32160
Total allocated so far: 64320000 B = 62812 KiB = 61 MiB
Total failed allocations: 67840

I guess this is some "implementation detail" I cannot know more about. Sure, I cannot expect to perform an infinite number of allocations, there is always some limit. But this limit is not enough for me. CUDA and HIP allow much more than this.

I am testing this on a private PVC1550 instance on the Tiber devcloud, levelzero backend, driver 1.3.27191. I also tried a PVC1100 instance, same driver, the results are identical. I tried versions of intel toolkit 2024.0.2, 2024.2.0, 2024.2.1, 2025.0.0, all seem to have the same results.

What's interesting, is that when I launch a job in the training/learning section (jupyterlab) of the Tiber devcloud, there are absolutely no issues:

$ ./program.x
Total failed allocations: 0

The only differences I see between the private PVC1100 instance and the training jupyterlab, is the used intel toolkit version (2024.0.2.20231213 vs 2024.2.1.20240711), GPU driver version (1.3.27191 vs 1.3.28202) and aspect ext_intel_free_memory on the GPUs (private instance does not have it, jupyterlab training has it).

I tried several toolkit versions on the PVC1550 instance and there was no difference. I don't think that the different GPU driver would affect host allocations, same with the free memory aspect.

What is going on here?

Is this expected? Is anyone else able to reproduce it?

I would appreciate any help.

Thanks,

Jakub

JakubH · ‎11-18-2024

Forgot to mention: I use a GPU sycl::device in the example code. If I use a CPU as a device, there are no issues.