Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

Failed to use mdspan in SYCL kernel code

wdx04
Beginner
1,085 Views

Hi,

 

I'm trying to use the C++23 mdspan class with USM in SYCL kernel code.

I use the mdspan reference implementation by kokkos:

kokkos/mdspan: Reference implementation of mdspan targeting C++23 (github.com)

Here is my test code that simply adding two 2D float matrices:

#include <experimental/mdspan>
#include <CL/sycl.hpp>

using namespace sycl;
using namespace std::experimental;

void matrix_add(queue& q, const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat1,
    const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat2, 
    const mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>>& mat3)
{
    q.submit([&](handler& cgh) {
        cgh.parallel_for(range<2>(mat1.extent(0), mat1.extent(1)), [=](id<2> idx) {
            int row = idx[0];
            int col = idx[1];
            float f1 = mat1(row, col);
            float f2 = mat2(row, col);
            mat3(row, col) = f1 + f2;
            });
        });
}

int main(int argc, char* argv[])
{
    queue q(gpu_selector_v);
    float* f1 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    float* f2 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    float* f3 = (float*)sycl::malloc_shared(sizeof(float) * 9, q);
    for (int i = 0; i < 9; i++)
    {
        f1[i] = float(i);
        f2[i] = float(i + 1);
    }
    mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>> f1m(f1, std::array<int, 2>{3, 3});
    mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>> f2m(f2, std::array<int, 2>{3, 3});
    mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>> f3m(f3, std::array<int, 2>{3, 3});
    matrix_add(q, f1m, f2m, f3m);
    return 0;
}

When compiling the code with the latest DPC++2003.2.1 compiler, I got different errors with different C++ languange standard settings:

With /std:c++17 or /std:c++20: 

1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(686,5): : error : static assertion failed due to requirement '!KernelHasName || sizeof (KernelFunc) == KernelInfo<sycl::detail::RoundedRangeKernel<sycl::item<2, true>, 2, (lambda at testdpc3.cpp:12:68)>>::getKernelSize()': Unexpected kernel lambda size. This can be caused by an external host compiler producing a lambda with an unexpected layout. This is a limitation of the compiler.In many cases the difference is related to capturing constexpr variables. In such cases removing constexpr specifier aligns the captures between the host compiler and the device compiler.
1>In case of MSVC, passing -fsycl-host-compiler-options='/std:c++latest' might also help.
1>    static_assert(
1>    ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(1031,7): note: in instantiation of function template specialization 'sycl::handler::StoreLambda<sycl::detail::RoundedRangeKernel<sycl::item<2>, 2, (lambda at testdpc3.cpp:12:68)>, sycl::detail::RoundedRangeKernel<sycl::item<2>, 2, (lambda at testdpc3.cpp:12:68)>, 2, sycl::item<2>>' requested here
1>      StoreLambda<KName, decltype(Wrapper), Dims, TransformedArgType>(
1>      ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(2072,5): note: in instantiation of function template specialization 'sycl::handler::parallel_for_lambda_impl<sycl::detail::auto_name, (lambda at testdpc3.cpp:12:68), 2, sycl::ext::oneapi::experimental::properties<std::tuple<>>>' requested here
1>    parallel_for_lambda_impl<KernelName, KernelType, 2, PropertiesT>(
1>    ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(2112,5): note: in instantiation of function template specialization 'sycl::handler::parallel_for<sycl::detail::auto_name, (lambda at testdpc3.cpp:12:68), sycl::ext::oneapi::experimental::properties<std::tuple<>>>' requested here
1>    parallel_for<KernelName>(
1>    ^
1>testdpc3.cpp(12,13): note: in instantiation of function template specialization 'sycl::handler::parallel_for<sycl::detail::auto_name, 2, (lambda at testdpc3.cpp:12:68)>' requested here
1>        cgh.parallel_for(range<2>(mat1.extent(0), mat1.extent(1)), [=](id<2> idx) {
1>            ^
1>C:\PROGRA~2\Intel\oneAPI\compiler\20232~1.1\windows\bin-llvm\..\include\sycl/handler.hpp(687,24): note: expression evaluates to 'false || false'
1>        !KernelHasName || sizeof(KernelFunc) == KI::getKernelSize(),
1>        ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With /std:c++latest:

1>testdpc3.cpp(15,24): : error : type 'const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            float f1 = mat1(row, col);
1>                       ^~~~
1>testdpc3.cpp(16,24): : error : type 'const mdspan<const float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            float f2 = mat2(row, col);
1>                       ^~~~
1>testdpc3.cpp(17,13): : error : type 'const mdspan<float, extents<int, std::dynamic_extent, std::dynamic_extent>>' does not provide a call operator
1>            mat3(row, col) = f1 + f2;

A simple mdspan program without using SYCL will compile in C++17 or C++20 mode, but fails to compile in C++Latest mode due to the same errors.

Please help me with this.

Thanks.

0 Kudos
1 Solution
SeshaP_Intel
Moderator
874 Views

Hi,


Thanks for your patience.


The implementation details of mdspan differ based on the C++ standard version. For C++20 and earlier, the implemented operator for mdspan is "()", the bracket operator "[]" is only implemented for the latest standard, i.e. c++23/c++2b/c++latest. On Windows, after modifying the code, the "call operator" error is gone and will throw the same "assertion" error.


The error on Windows is because the size of the kernel object is different on the host vs the device. On the device, the size is 72, but on the host, it is 48. This in turn is because of the _compressed_pair template defined in mdspan/include/experimental/_p0009_bits/compressed_pair.hpp


On the device, the template defined is the one with the [[no_unique_address]] data members. On the host, three more versions are defined. Commenting out the three other templates seems to get rid of the errors.

We cannot have a different layout on the host vs the device; we will get run-time errors. The assertion from the header file is supposed to help prevent that.


And we need to have kernels for both host and device because the lambda object that is passed to the parallel_for, is converted into an OpenCL kernel on the device. 

On the host side, we describe the data layout of the lambda object, so the Runtime library can do its part to copy the contents to the device. 

On the device side, we call this kernel function. So the host side needs to see the same layout as the device.


And we will not be able to see this error on Linux, because the object models on Windows and Linux are different. On Windows, we see a difference in size and issue an error early. If the size matches but the layout differs, currently we are not able to detect it, but it would lead to run-time errors.


Could you please confirm whether we can close this thread from our end?


Thanks and Regards,

Pendyala Sesha Srinivas


View solution in original post

0 Kudos
5 Replies
SeshaP_Intel
Moderator
1,031 Views

Hi,


Thank you for posting in Intel Communities.


We were able to reproduce your issue. We are working on it internally. 

We will get back to you soon.


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
SeshaP_Intel
Moderator
875 Views

Hi,


Thanks for your patience.


The implementation details of mdspan differ based on the C++ standard version. For C++20 and earlier, the implemented operator for mdspan is "()", the bracket operator "[]" is only implemented for the latest standard, i.e. c++23/c++2b/c++latest. On Windows, after modifying the code, the "call operator" error is gone and will throw the same "assertion" error.


The error on Windows is because the size of the kernel object is different on the host vs the device. On the device, the size is 72, but on the host, it is 48. This in turn is because of the _compressed_pair template defined in mdspan/include/experimental/_p0009_bits/compressed_pair.hpp


On the device, the template defined is the one with the [[no_unique_address]] data members. On the host, three more versions are defined. Commenting out the three other templates seems to get rid of the errors.

We cannot have a different layout on the host vs the device; we will get run-time errors. The assertion from the header file is supposed to help prevent that.


And we need to have kernels for both host and device because the lambda object that is passed to the parallel_for, is converted into an OpenCL kernel on the device. 

On the host side, we describe the data layout of the lambda object, so the Runtime library can do its part to copy the contents to the device. 

On the device side, we call this kernel function. So the host side needs to see the same layout as the device.


And we will not be able to see this error on Linux, because the object models on Windows and Linux are different. On Windows, we see a difference in size and issue an error early. If the size matches but the layout differs, currently we are not able to detect it, but it would lead to run-time errors.


Could you please confirm whether we can close this thread from our end?


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
wdx04
Beginner
857 Views

Hi,

 

You can close this thread now.

 

Thanks.

 

0 Kudos
SeshaP_Intel
Moderator
851 Views

Hi,


Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
crtrott
Beginner
665 Views

Hi,

 

as the author of mdspan (and the contributor of mdspan to llvm 17) I think this is not resolved and points to a much bigger underlying problem of the compiler. If our mdspan implementation uses a different implementation in device and host code, than that means that fundamental macros are differently defined on Windows for host and device. For example the question of whether no-unique-address is available. If I understand the problem correct, wouldn't any code which does this: 

template<class ... T>
struct Foo {
  #if ((__has_cpp_attribute(no_unique_address) >= 201803L))
  [[no_unique_address]] Bar<T...> a;
  #else
  Bar<T...> a;
  #endif
  double b;
};
  

 not work if captured in a lambda for parallel execution?

 

If my understanding is correct than this is a serious compiler issue and not a resolved issue ...

0 Kudos
Reply