one API runtime causing segmentation issues while using joint_matrix calls

Vaibhav_S_Intel · ‎01-28-2025

I have a intel NUC with onboard A770 GPU. NUC12SNKi72.

CPU - ADL i7 12700H

RAM 64GB

GPU A770

I have installed one API 2025 base tool kit using WSL ubuntu on my windows machine. Its detecting my a770 GPU, which can be seen through syscl-ls.

[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x5690] OpenCL 3.0 NEO [23.17.26241.33]

[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x46a6] OpenCL 3.0 NEO [23.17.26241.33]

when i execute any GPU kernel which calls joint_matrix* APIs. It causes seg fault. This is a sample code for xmx using joint_matrix_fill-->

#include <sycl/sycl.hpp>
#include <sycl/ext/oneapi/matrix/matrix.hpp>
#include <iostream>

using namespace sycl::ext::oneapi::experimental::matrix;

constexpr size_t TM = 8; // Tile dimensions
constexpr size_t TN = 8;

void test_joint_matrix_fill(sycl::queue &q, size_t SG_SZ) {
std::cout << "Testing joint_matrix_fill with subgroup size: " << SG_SZ << "\n";

try {
sycl::buffer<float, 2> bufC(sycl::range<2>(TM, TN)); // Buffer for storing results

if (SG_SZ == {
q.submit([&](sycl::handler &h) {
auto accC = bufC.get_access<sycl::access::mode::write>(h);

h.parallel_for(
sycl::nd_range<2>({1, 8}, {1, 8}),
[=](sycl::nd_item<2> it) [[intel::reqd_sub_group_size(8)]] {
sycl::sub_group sg = it.get_sub_group();

joint_matrix<sycl::sub_group, float, use::accumulator, TM, TN> sub_acc;

// Step 1: Initialize joint_matrix with a constant value
joint_matrix_fill(sg, sub_acc, 1.0f);

// Step 2: Store the joint_matrix result back to global memory
joint_matrix_store(sg, sub_acc, accC.get_pointer(), TN, layout::row_major);
});
}).wait();
} else if (SG_SZ == 16) {
q.submit([&](sycl::handler &h) {
auto accC = bufC.get_access<sycl::access::mode::write>(h);

h.parallel_for(
sycl::nd_range<2>({1, 16}, {1, 16}),
[=](sycl::nd_item<2> it) [[intel::reqd_sub_group_size(16)]] {
sycl::sub_group sg = it.get_sub_group();

joint_matrix<sycl::sub_group, float, use::accumulator, TM, TN> sub_acc;

// Step 1: Initialize joint_matrix with a constant value
joint_matrix_fill(sg, sub_acc, 1.0f);

// Step 2: Store the joint_matrix result back to global memory
joint_matrix_store(sg, sub_acc, accC.get_pointer(), TN, layout::row_major);
});
}).wait();
} else if (SG_SZ == 32) {
q.submit([&](sycl::handler &h) {
auto accC = bufC.get_access<sycl::access::mode::write>(h);

h.parallel_for(
sycl::nd_range<2>({1, 32}, {1, 32}),
[=](sycl::nd_item<2> it) [[intel::reqd_sub_group_size(32)]] {
sycl::sub_group sg = it.get_sub_group();

joint_matrix<sycl::sub_group, float, use::accumulator, TM, TN> sub_acc;

// Step 1: Initialize joint_matrix with a constant value
joint_matrix_fill(sg, sub_acc, 1.0f);

// Step 2: Store the joint_matrix result back to global memory
joint_matrix_store(sg, sub_acc, accC.get_pointer(), TN, layout::row_major);
});
}).wait();
} else {
std::cerr << "Unsupported subgroup size: " << SG_SZ << "\n";
return;
}

// Retrieve and print results
auto hostC = bufC.get_access<sycl::access::mode::read>();
std::cout << "Resultant matrix C:\n";
for (size_t i = 0; i < TM; i++) {
for (size_t j = 0; j < TN; j++) {
std::cout << hostC[i][j] << " ";
}
std::cout << "\n";
}
} catch (sycl::exception const &e) {
std::cerr << "SYCL exception caught: " << e.what() << "\n";
}
}

int main() {
try {
// Initialize the SYCL queue
sycl::queue q{sycl::default_selector{}};
std::cout << "Running on device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

// Query supported subgroup sizes
auto subgroup_sizes = q.get_device().get_info<sycl::info::device::sub_group_sizes>();
std::cout << "Supported subgroup sizes: ";
for (const auto &size : subgroup_sizes) {
std::cout << size << " ";
}
std::cout << "\n";

// Test with each supported subgroup size
for (const auto &SG_SZ : subgroup_sizes) {
test_joint_matrix_fill(q, SG_SZ);
}

// Optional: Force CPU execution to isolate GPU-specific issues
std::cout << "Testing on CPU...\n";
sycl::queue cpu_queue{sycl::cpu_selector{}};
test_joint_matrix_fill(cpu_queue, 8); // Default subgroup size for CPU

} catch (sycl::exception const &e) {
std::cerr << "SYCL exception caught during initialization: " << e.what() << "\n";
return 1;
}

return 0;
}

Compile this code with icpx -fsycl <file> -o <output>

From the GDB back trace -->

#35 0x0000000000404888 in test_joint_matrix_fill (q=..., SG_SZ=8) at xmx.cpp:17

#34 0x0000000000404d61 in sycl::_V1::queue::submit<test_joint_matrix_fill(...)>::submit(...) at sycl/queue.hpp:359

#8 0x00007fffd9e10973 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1

#14 0x00007fffe4edfc0f in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so

#17 0x00007ffff4edf47c in urProgramBuild () from /home/vaibhav/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0

libigc.so.1: Intel Graphics Compiler (IGC), which compiles kernels for Intel GPUs.
libigdrcl.so: Intel GPU runtime library responsible for managing GPU tasks.
urProgramBuild: Part of the Unified Runtime (UR) that manages kernel program building.

The crash occurs during kernel compilation or execution by the GPU runtime.

Need your teams support to check this issue with oneAPI runtime on linux using NUC with A770. i tired this on another NUC with A770, the issue is reproducible.