Solved: Failed to use mdspan in SYCL kernel code

wdx04 · ‎10-16-2023

Hi,

I'm trying to use the C++23 mdspan class with USM in SYCL kernel code.

I use the mdspan reference implementation by kokkos:

kokkos/mdspan: Reference implementation of mdspan targeting C++23 (github.com)

Here is my test code that simply adding two 2D float matrices:

#include <experimental/mdspan>
#include <CL/sycl.hpp>

using namespace sycl;
using namespace std::experimental;

void matrix_add(queue& q, const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat1,
    const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat2, 
    const mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat3)
{
    q.submit([&](handler& cgh) {
        cgh.parallel_for(range<2>(mat1.extent(0), mat1.extent(1)), [=](id<2> idx) {
            int row = idx[0];
            int col = idx[1];
            float f1 = mat1(row, col);
            float f2 = mat2(row, col);
            mat3(row, col) = f1 + f2;
            });
        });
}

int main(int argc, char* argv[])
{
    queue q(gpu_selector_v);
    float* f1 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    float* f2 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    float* f3 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    for (int i = 0; i < 9; i++)
    {
        f1[i] = float(i);
        f2[i] = float(i + 1);
    }
    mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>> f1m(f1, std::array<int, 2>{3, 3});
    mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>> f2m(f2, std::array<int, 2>{3, 3});
    mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>> f3m(f3, std::array<int, 2>{3, 3});
    matrix_add(q, f1m, f2m, f3m);
    return 0;
}

When compiling the code with the latest DPC++2003.2.1 compiler, I got different errors with different C++ languange standard settings:

With /std:c++17 or /std:c++20:

1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(686,5): : error : static assertion failed due to requirement '!KernelHasName || sizeof (KernelFunc) == KernelInfo<sycl::detail::RoundedRangeKernel<sycl::item<2, true>, 2, (lambda at testdpc3.cpp:12:68)>>::getKernelSize()': Unexpected kernel lambda size. This can be caused by an external host compiler producing a lambda with an unexpected layout. This is a limitation of the compiler.In many cases the difference is related to capturing constexpr variables. In such cases removing constexpr specifier aligns the captures between the host compiler and the device compiler.
1>In case of MSVC, passing -fsycl-host-compiler-options='/std:c++latest' might also help.
1>    static_assert(
1>    ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(1031,7): note: in instantiation of function template specialization 'sycl::handler::StoreLambda<sycl::detail::RoundedRangeKernel<sycl::item<2>, 2, (lambda at testdpc3.cpp:12:68)>, sycl::detail::RoundedRangeKernel<sycl::item<2>, 2, (lambda at testdpc3.cpp:12:68)>, 2, sycl::item<2>>' requested here
1>      StoreLambda<KName, decltype(Wrapper), Dims, TransformedArgType>(
1>      ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(2072,5): note: in instantiation of function template specialization 'sycl::handler::parallel_for_lambda_impl<sycl::detail::auto_name, (lambda at testdpc3.cpp:12:68), 2, sycl::ext::oneapi::experimental::properties<std::tuple<>>>' requested here
1>    parallel_for_lambda_impl<KernelName, KernelType, 2, PropertiesT>(
1>    ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(2112,5): note: in instantiation of function template specialization 'sycl::handler::parallel_for<sycl::detail::auto_name, (lambda at testdpc3.cpp:12:68), sycl::ext::oneapi::experimental::properties<std::tuple<>>>' requested here
1>    parallel_for<KernelName>(
1>    ^
1>testdpc3.cpp(12,13): note: in instantiation of function template specialization 'sycl::handler::parallel_for<sycl::detail::auto_name, 2, (lambda at testdpc3.cpp:12:68)>' requested here
1>        cgh.parallel_for(range<2>(mat1.extent(0), mat1.extent(1)), [=](id<2> idx) {
1>            ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(687,24): note: expression evaluates to 'false || false'
1>        !KernelHasName || sizeof(KernelFunc) == KI::getKernelSize(),
1>        ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With /std:c++latest:

1>testdpc3.cpp(15,24): : error : type 'const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            float f1 = mat1(row, col);
1>                       ^~~~
1>testdpc3.cpp(16,24): : error : type 'const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            float f2 = mat2(row, col);
1>                       ^~~~
1>testdpc3.cpp(17,13): : error : type 'const mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            mat3(row, col) = f1 + f2;

A simple mdspan program without using SYCL will compile in C++17 or C++20 mode, but fails to compile in C++Latest mode due to the same errors.

Please help me with this.

Thanks.

SeshaP_Intel · ‎11-09-2023

Hi,

Thanks for your patience.

The implementation details of mdspan differ based on the C++ standard version. For C++20 and earlier, the implemented operator for mdspan is "()", the bracket operator "[]" is only implemented for the latest standard, i.e. c++23/c++2b/c++latest. On Windows, after modifying the code, the "call operator" error is gone and will throw the same "assertion" error.

The error on Windows is because the size of the kernel object is different on the host vs the device. On the device, the size is 72, but on the host, it is 48. This in turn is because of the _compressed_pair template defined in mdspan/include/experimental/_p0009_bits/compressed_pair.hpp

On the device, the template defined is the one with the [[no_unique_address]] data members. On the host, three more versions are defined. Commenting out the three other templates seems to get rid of the errors.

We cannot have a different layout on the host vs the device; we will get run-time errors. The assertion from the header file is supposed to help prevent that.

And we need to have kernels for both host and device because the lambda object that is passed to the parallel_for, is converted into an OpenCL kernel on the device.

On the host side, we describe the data layout of the lambda object, so the Runtime library can do its part to copy the contents to the device.

On the device side, we call this kernel function. So the host side needs to see the same layout as the device.

And we will not be able to see this error on Linux, because the object models on Windows and Linux are different. On Windows, we see a difference in size and issue an error early. If the size matches but the layout differs, currently we are not able to detect it, but it would lead to run-time errors.

Could you please confirm whether we can close this thread from our end?

Thanks and Regards,

Pendyala Sesha Srinivas

View solution in original post

SeshaP_Intel · ‎10-19-2023

Hi,

Thank you for posting in Intel Communities.

We were able to reproduce your issue. We are working on it internally.

We will get back to you soon.

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎11-09-2023