Solved: Intel oneAPI DPCT can't convert from CUDA 1-Channel texture to DPCT 4-Channel image_wrapper

ManuelCostanzo22 · ‎09-14-2021

Hi everyone !!

I have the following problem:

I have a CUDA code that uses texture, for example:

texture<unsigned char, 2, cudaReadModeElementType> text;
unsigned char a = tex2D(text, cx + lx, cy);

So when I use DPCT, I have this output:

/*
DPCT1059:12: SYCL only supports 4-channel image format. Adjust the code.
*/
dpct::image_wrapper<unsigned char, 2> text;

SO, I changed the declaration with this:

dpct::image_wrapper<sycl::uchar4, 2> text_wrp;

And after launch kernel:
dpct::image_accessor_ext<sycl::uchar4, 2> text

My problem is that I don't know how to create an equivalent when reading. I have this:

sycl::uchar4 a = text.read(cx + lx, cy);

But I don't know how to get the same unsigned char from CUDA in my DPC code. Do I have to modify the indexes on the read ? I'm really lost.

Thank you so much in advance !!

Alina_S_Intel · ‎11-08-2021

Thank you for your patience while we were working on this question. We found a bug in DPC++ Compatibility Tool and it took more time to investigate it and find a workaround than we expected.

The issue is with the following line:

--------------------------------------------------------------------------textureD.attach(d, width, height, pitch, channel);

--------------------------------------------------------------------------

You need to replace it with:

--------------------------------------------------------------------------

textureD.attach(d, width * sizeof(int), height, pitch, channel);

--------------------------------------------------------------------------

We expect that the fix for this issue will be included in the next release.

Since I'm not able to attach files, I have to post the code here:

--------------------------------------------------------------------------

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>

#include <stdio.h>

dpct::image_wrapper<cl::sycl::uint4, 2> textureD;

void kernel2(int *d, int *dOutput, dpct::image_accessor_ext<cl::sycl::uint4, 2> acc, int width, int height){

for(int row = 0; row < height; row++) {

for(int col = 0; col < width; col++){

cl::sycl::uint4 tex_data;

tex_data = acc.read(col, row) ;

dOutput[row * width + col] = tex_data[0] ;

}

int main(){

dpct::device_ext &dev_ct1 = dpct::get_current_device();

sycl::queue &q_ct1 = dev_ct1.default_queue();

int *h;

int width = 10;

int height = 10;

int size = width * height;

h = (int *)sycl::malloc_host(size* sizeof(int), q_ct1);

int i = 0;

for(int row = 0; row < height; row++) {

for(int col = 0; col < width; col++){

h[row * width + col] = i;

i++;

}

int *d;

size_t pitch;

d = (int *)dpct::dpct_malloc(pitch, width * sizeof(int), height);

dpct::dpct_memcpy(d, pitch, h, width * sizeof(int), width * sizeof(int),

height, dpct::host_to_device);

dpct::image_channel channel = dpct::image_channel::create<int>();

//textureD.attach(d, width, height, pitch, channel);

textureD.attach(d, width * sizeof(int), height, pitch, channel); // <- this line contains a workaround for DPCT bug. Fix for this issues is expected to be fixed in the next release so please, try to remove '* sizeof(int)' when the release will be available.

int *hOutput;

hOutput = (int *) sycl::malloc_host(size* sizeof(int), q_ct1);

int *dOutput;

dOutput = (int *)sycl::malloc_device(size* sizeof(int), q_ct1);

q_ct1.submit([&](sycl::handler &cgh) {

auto textureD_acc = textureD.get_access(cgh);

auto textureD_smpl = textureD.get_sampler();

auto out = sycl::stream(1024, 768, cgh);

cgh.single_task<class dpct_single_kernel>([=] {

kernel2(d, dOutput,

dpct::image_accessor_ext<cl::sycl::uint4, 2>(textureD_smpl, textureD_acc),

width, height);

});

}).wait();

q_ct1.memcpy(hOutput, dOutput, size * sizeof(int)).wait();

for(int row = 0; row < height; row++){

for(int col = 0; col < width; col++){

printf("%d ", h[row * width + col]);

}

printf("\n");

}

printf("\n");

for(int row = 0; row < height; row++){

for(int col = 0; col < width; col++){

printf("%d ", hOutput[row * width + col]);

}

printf("\n");

}

getchar();

sycl::free(h, q_ct1);

sycl::free(d, q_ct1);

sycl::free(dOutput, q_ct1);

sycl::free(hOutput, q_ct1);

return 0;

}

--------------------------------------------------------------------------

We will no longer respond to this thread.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,

Alina

View solution in original post

NoorjahanSk_Intel · ‎09-15-2021

Hi,

Thanks for reaching out to us.

Could you please provide us a sample reproducer of both CUDA, DPCPP so that we can work on it from our end.

And also please do let us know your environment details

OS & version

compiler version.

Thanks & Regards

Noorjahan.

ManuelCostanzo22 · ‎09-15-2021

Hi @NoorjahanSk_Intel Thank you so much for reply.

Here I attach two files: the original CUDA code and the original DPCT output. Could you help me to finish the migration ?

Thank you again !

ManuelCostanzo22 · ‎09-15-2021

@NoorjahanSk_Intel Here I have a code that compiles, but I get a wrong output (is not the same as CUDA output).

NoorjahanSk_Intel · ‎09-20-2021

Hi,

We are working on it. We will get back to you soon.

Thanks & Regards

Noorjahan

ManuelCostanzo22 · ‎09-20-2021

Hi !

Thank you so much ! I will be waiting .

Alina_S_Intel · ‎09-24-2021

Thanks for your patience!

Does the solution in DPC++ Developer Guide work for you?

https://software.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top/diagnostics-reference/dpct1059.html

The difference is that DevGuide recommends using single_task instead of parallel ND range.

ManuelCostanzo22 · ‎09-24-2021

Hi @Alina_S_Intel !

No, but maybe I'm doing something wrong. I added an example with the native CUDA code and the oneAPI code migrated. Can you modify the oneAPI code to get the same result as CUDA, please ?

Thank you so much.

ManuelCostanzo22 · ‎09-24-2021

@Alina_S_Intel Here you have the example attached again ! And if you like, here you have de CUDA code that works. I want to migrate this code.

ManuelCostanzo22 · ‎09-28-2021

@NoorjahanSk_Intel @Alina_S_Intel Hi, sorry, any update on this? Thank you.

Alina_S_Intel · ‎10-01-2021

Thank you for your patience. Unfortunately, I am not able to help you with the full migration because the CUDA code is incorrect.

1) The CUDA code gives 9 starting the second row:

0 1 2 3 4 5 6 7 8 9

9 9 9 9 9 9 9 9 9 9

while initial array is :

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

80 81 82 83 84 85 86 87 88 89

90 91 92 93 94 95 96 97 98 99

It should return the same array.

2) Speaking about migrated code. You need to change :

sycl::malloc_host<int><int>(size, q_ct1);

to

(int *)sycl::malloc_host(size* sizeof(int), q_ct1);

for each SYCL malloc function. It helps you to avoid the following errors:

------------------------------------------------------------------------------------

$ dpcpp test.dp.cpp

test.dp.cpp:56:39: error: expected '(' for function-style cast or type construction

h = sycl::malloc_host<int><int>(size, q_ct1);

~~~^

test.dp.cpp:56:41: warning: expression result unused [-Wunused-value]

h = sycl::malloc_host<int><int>(size, q_ct1);

^~~~

test.dp.cpp:84:20: error: no matching function for call to 'dpct_malloc'

d = (int *)dpct::dpct_malloc<int>(pitch, width * sizeof(int), height);

^~~~~~~~~~~~~~~~~~~~~~

/nfs/pdx/disks/cts2/tools/oneapi/2021.3.0/dpcpp-ct/2021.3.0/include/dpct/memory.hpp:576:21: note: candidate function template not viable: requires at most 2 arguments, but 3 were provided

static inline void *dpct_malloc(T num_bytes,

^

test.dp.cpp:106:45: error: expected '(' for function-style cast or type construction

hOutput = sycl::malloc_host<int><int>(size, q_ct1);

~~~^

test.dp.cpp:106:47: warning: expression result unused [-Wunused-value]

hOutput = sycl::malloc_host<int><int>(size, q_ct1);

^~~~

test.dp.cpp:110:47: error: expected '(' for function-style cast or type construction

dOutput = sycl::malloc_device<int><int>(size, q_ct1);

~~~^

test.dp.cpp:110:49: warning: expression result unused [-Wunused-value]

dOutput = sycl::malloc_device<int><int>(size, q_ct1);

------------------------------------------------------------------------------------

3) When you are done with errors, you can to go back to DPCT1059:2.

If you try to run the application 'as is', you'll see the following error:

------------------------------------------

./a.out

die: piMemImageCreate: unsupported image format layout

libc++abi: terminating

Aborted (core dumped)

------------------------------------------

It happens because our default LevelZero backend does not support this feature yet. Please, use SYCL_BE=PI_OPENCL (deprecated variable) or SYCL_DEVICE_FILTER=opencl:gpu (the replacement for SYCL_BE and SYCL_DEVICE_TYPE) to run your application.

ManuelCostanzo22 · ‎10-01-2021

Thank you !

Yes, I know that the CUDA code returns all 9, but I expected to get the same result in oneAPI, even if the CUDA code doesn't make sense. Was an example code.

So I can't do anything to solve the 4-Channel problem? We cant modify the 1-Channel to transform it to 4-Channel and getting the same CUDA result (with the same 9)?

Alina_S_Intel · ‎11-08-2021

Thank you for your patience while we were working on this question. We found a bug in DPC++ Compatibility Tool and it took more time to investigate it and find a workaround than we expected.

The issue is with the following line:

--------------------------------------------------------------------------textureD.attach(d, width, height, pitch, channel);

--------------------------------------------------------------------------

You need to replace it with:

--------------------------------------------------------------------------

textureD.attach(d, width * sizeof(int), height, pitch, channel);

--------------------------------------------------------------------------

We expect that the fix for this issue will be included in the next release.

Since I'm not able to attach files, I have to post the code here:

--------------------------------------------------------------------------

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>

#include <stdio.h>

dpct::image_wrapper<cl::sycl::uint4, 2> textureD;

void kernel2(int *d, int *dOutput, dpct::image_accessor_ext<cl::sycl::uint4, 2> acc, int width, int height){

for(int row = 0; row < height; row++) {

for(int col = 0; col < width; col++){

cl::sycl::uint4 tex_data;

tex_data = acc.read(col, row) ;

dOutput[row * width + col] = tex_data[0] ;

}

int main(){

dpct::device_ext &dev_ct1 = dpct::get_current_device();

sycl::queue &q_ct1 = dev_ct1.default_queue();

int *h;

int width = 10;

int height = 10;

int size = width * height;

h = (int *)sycl::malloc_host(size* sizeof(int), q_ct1);

int i = 0;

for(int row = 0; row < height; row++) {

for(int col = 0; col < width; col++){

h[row * width + col] = i;

i++;

}

int *d;

size_t pitch;

d = (int *)dpct::dpct_malloc(pitch, width * sizeof(int), height);

dpct::dpct_memcpy(d, pitch, h, width * sizeof(int), width * sizeof(int),

height, dpct::host_to_device);

dpct::image_channel channel = dpct::image_channel::create<int>();

//textureD.attach(d, width, height, pitch, channel);

textureD.attach(d, width * sizeof(int), height, pitch, channel); // <- this line contains a workaround for DPCT bug. Fix for this issues is expected to be fixed in the next release so please, try to remove '* sizeof(int)' when the release will be available.

int *hOutput;

hOutput = (int *) sycl::malloc_host(size* sizeof(int), q_ct1);

int *dOutput;

dOutput = (int *)sycl::malloc_device(size* sizeof(int), q_ct1);

q_ct1.submit([&](sycl::handler &cgh) {

auto textureD_acc = textureD.get_access(cgh);

auto textureD_smpl = textureD.get_sampler();

auto out = sycl::stream(1024, 768, cgh);

cgh.single_task<class dpct_single_kernel>([=] {

kernel2(d, dOutput,

dpct::image_accessor_ext<cl::sycl::uint4, 2>(textureD_smpl, textureD_acc),

width, height);

});

}).wait();

q_ct1.memcpy(hOutput, dOutput, size * sizeof(int)).wait();

for(int row = 0; row < height; row++){

for(int col = 0; col < width; col++){

printf("%d ", h[row * width + col]);

}

printf("\n");

}

printf("\n");

for(int row = 0; row < height; row++){

for(int col = 0; col < width; col++){

printf("%d ", hOutput[row * width + col]);

}

printf("\n");

}

getchar();

sycl::free(h, q_ct1);

sycl::free(d, q_ct1);

sycl::free(dOutput, q_ct1);

sycl::free(hOutput, q_ct1);

return 0;

}

--------------------------------------------------------------------------

We will no longer respond to this thread.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,

Alina

ManuelCostanzo22 · ‎11-08-2021

@Alina_S_Intel Hi, thank you for reply. Sorry, but you are using

dpct::image_channel channel = dpct::image_channel::create<int>();

But 1-Channel doesn't work for me, I have the runtime error:

die: piMemImageCreate: unsupported image format layout

libc++abi: terminating
Aborted

However, if I change int to int4, it doesn't work either

Alina_S_Intel · ‎11-08-2021

3) When you are done with errors, you can to go back to DPCT1059:2.

If you try to run the application 'as is', you'll see the following error:

------------------------------------------

./a.out

die: piMemImageCreate: unsupported image format layout

libc++abi: terminating

Aborted (core dumped)

------------------------------------------

It happens because our default LevelZero backend does not support this feature yet. Please, use SYCL_BE=PI_OPENCL (deprecated variable) or SYCL_DEVICE_FILTER=opencl:gpu (the replacement for SYCL_BE and SYCL_DEVICE_TYPE) to run your application.

ManuelCostanzo22 · ‎11-08-2021

Thank you so much! You are helping me a lot.

I've just created a new topic with a working CUDA code that I want to migrate, but I can't. Maybe you find the problem there. Thank you so much again.

ManuelCostanzo22 · ‎11-08-2021

Sorry, using SYCL_DEVICE_FILTER=opencl:gpu I get CL_DEVICE_NOT_FOUND. I'm using a NVIDIA GeForce RTX 2070 gpu. Do you know why ?