Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*

dpct does not translate cudaArray and texture objects

Andreas_Gravgaard_A_
2,109 Views

I can't really find any documentation on the specifics of dpct, so I'm sorry if this is an uninformed question:

I tried the "Intel® DPC++ Compatibility Tool" (dpct) with a CUDA project i have.
following the instructions in the Get Started guide:

dpct -p=compile_commands.json --in-root=../ --out-root=dpct-output

It seems to work well with most things. However the projects uses cudaArrays and textures, which were not translated into SYCL objects. I think the equivalent OpenCL object is the image classes, and it looks like SYCL also has an equivalent class, but I'm very new to SYCL.

An example from the output of dpct:

cudaArray *ctData; // for CT resampling
texture<float,3,cudaReadModeElementType> ctData_tex;

These objects live in global scope, if that makes a difference. (CT as in computed tomography, so it's a 3D image).

Can I do anything to make dpct also translate these objects? Or is the SYCL image class not sufficiently equivalent?

(PS. dpct outputs an awful lot of redundant static_cast<> and sometimes inserts the entire operator overload function, "...::operator*()", instead of just keeping the * for multiplication (making the code look like the output from cppinsights.io). This is easy for me to fix with some Vim tricks, but consider beautifying the output a bit)

Best regards

Andreas

0 Kudos
6 Replies
Pradeep_G_Intel
Employee
2,109 Views

Hi Andreas,

We have also observed the same issue. We will inform this to the concerned team. 

Pradeep.       

0 Kudos
JenniferJ
Moderator
2,109 Views

Thanks for trying out the tool. A jira has filed to track this issue. 

As for the sycl class that is mostly close to what you want, it is the image class. you can refer to the sycl spec for more info about this class at https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf 

It'll be nice to know how it goes for you. 

Thanks,

Jennifer 

0 Kudos
Andreas_Gravgaard_A_
2,109 Views

Thanks Jennifer

I did a bit more work on it today, and got further by simply replacing the cuda textures with cl::sycl::image.
However, the tool had already replaced the cuda unbind texture calls with dpct::dpct_detach_image, which would be fitting if I had replaced the textures with the dpct::dpct_image class - maybe this class better reflects the cuda texture class?
(For now i replaced the detach calls with cl::sycl::free, which may not be the best way, but doesn't produce a compiler error, so it's good enough for getting to the next problem)

The next errors was marked with DPCT1003 1007 and 1004 - related to the textures.
And some related to cuSPARSE and cuRAND, but I guess those would require an equivalent external library compatible with sycl ( like https://github.com/clMathLibraries ), or maybe MKL can cover that?

I'll also try and rewrite those parts manually, but it may take a while before I get the time.

Best regards
Andreas

0 Kudos
JenniferJ
Moderator
2,109 Views

Hello Andreas,

How are you doing? Trying to follow up on how you're doing with regarding the errors (DPCT1003 1007 and 1004 - related to the textures) and cuSPARSE and cuRAND. You're right that MKL does provide those SPARSE & RAND APIs.

Can you provide some details/code snippets about those DPCT errors? 

Thanks,

Jennifer  

0 Kudos
Andreas_Gravgaard_A_
2,109 Views

Hi again,
1003: Was just error-handling (AFAIR). I think DPCT performed as expected.
1007: make_cudaExtent, cudaMalloc3DArray, make_cudaPitchedPtr, cudaMemcpy3D
1004: cudaMalloc3DArray, cudaBindTextureToArray

I got a bit further, but gave up when I noticed that lambdas like this had been generated:

dpct::get_default_queue_wait().submit(
                [&](cl::sycl::handler &cgh) {
                  extern dpct::device_memory<int, 0> nactive;
                  auto nactive_acc_ct1 = nactive.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float4, 1> x;
                  auto x_acc_ct1 = x.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float4, 1> vx;
                  auto vx_acc_ct1 = vx.get_access(cgh);
                  extern dpct::device_memory<int, 1> ix;
                  auto ix_acc_ct1 = ix.get_access(cgh);
                  extern dpct::device_memory<int, 0> energybins;
                  auto energybins_acc_ct1 = energybins.get_access(cgh);
                  extern dpct::device_memory<float, 0> energybins_min;
                  auto energybins_min_acc_ct1 = energybins_min.get_access(cgh);
                  extern dpct::device_memory<float, 0> energybins_size;
                  auto energybins_size_acc_ct1 = energybins_size.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifScoreDose;
                  auto ifScoreDose_acc_ct1 = ifScoreDose.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifScoreEnergy;
                  auto ifScoreEnergy_acc_ct1 = ifScoreEnergy.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifScoreLET;
                  auto ifScoreLET_acc_ct1 = ifScoreLET.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifScoreSpectrum;
                  auto ifScoreSpectrum_acc_ct1 = ifScoreSpectrum.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifScoreStop;
                  auto ifScoreStop_acc_ct1 = ifScoreStop.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float3, 0> scoreShifts;
                  auto scoreShifts_acc_ct1 = scoreShifts.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float3, 0> scoreVoxSize;
                  auto scoreVoxSize_acc_ct1 = scoreVoxSize.get_access(cgh);
                  extern dpct::device_memory<float, 0> scoreVoxVolume;
                  auto scoreVoxVolume_acc_ct1 = scoreVoxVolume.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::int3, 0> scoreVoxN;
                  auto scoreVoxN_acc_ct1 = scoreVoxN.get_access(cgh);
                  extern dpct::device_memory<float, 0> scoreRotX;
                  auto scoreRotX_acc_ct1 = scoreRotX.get_access(cgh);
                  extern dpct::device_memory<bool, 0> ifDoseToWater;
                  auto ifDoseToWater_acc_ct1 = ifDoseToWater.get_access(cgh);
                  extern dpct::device_memory<float, 0> epabs;
                  auto epabs_acc_ct1 = epabs.get_access(cgh);
                  extern dpct::device_memory<float, 0> eemin;
                  auto eemin_acc_ct1 = eemin.get_access(cgh);
                  extern dpct::device_memory<float, 0> stepmax0;
                  auto stepmax0_acc_ct1 = stepmax0.get_access(cgh);
                  extern dpct::device_memory<float, 0> maxEdecay;
                  auto maxEdecay_acc_ct1 = maxEdecay.get_access(cgh);
                  extern dpct::device_memory<float, 0> maxAllowedLET;
                  auto maxAllowedLET_acc_ct1 = maxAllowedLET.get_access(cgh);
                  extern dpct::device_memory<curandState, 1> cuseed;
                  auto cuseed_acc_ct1 = cuseed.get_access(cgh);
                  extern dpct::device_memory<float, 0> dx;
                  auto dx_acc_ct1 = dx.get_access(cgh);
                  extern dpct::device_memory<float, 0> dy;
                  auto dy_acc_ct1 = dy.get_access(cgh);
                  extern dpct::device_memory<float, 0> dz;
                  auto dz_acc_ct1 = dz.get_access(cgh);
                  extern dpct::device_memory<float, 0> idx;
                  auto idx_acc_ct1 = idx.get_access(cgh);
                  extern dpct::device_memory<float, 0> idy;
                  auto idy_acc_ct1 = idy.get_access(cgh);
                  extern dpct::device_memory<float, 0> idz;
                  auto idz_acc_ct1 = idz.get_access(cgh);
                  extern dpct::device_memory<int, 0> Unxvox;
                  auto Unxvox_acc_ct1 = Unxvox.get_access(cgh);
                  extern dpct::device_memory<int, 0> Unyvox;
                  auto Unyvox_acc_ct1 = Unyvox.get_access(cgh);
                  extern dpct::device_memory<int, 0> Unzvox;
                  auto Unzvox_acc_ct1 = Unzvox.get_access(cgh);
                  extern dpct::device_memory<size_t, 0> scoreTotalVoxN;
                  auto scoreTotalVoxN_acc_ct1 = scoreTotalVoxN.get_access(cgh);
                  extern dpct::device_memory<float, 0> idestpr;
                  auto idestpr_acc_ct1 = idestpr.get_access(cgh);
                  extern dpct::device_memory<float, 0> estpr0;
                  auto estpr0_acc_ct1 = estpr0.get_access(cgh);
                  extern dpct::device_memory<int, 0> npstk;
                  auto npstk_acc_ct1 = npstk.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float4, 1> psx;
                  auto psx_acc_ct1 = psx.get_access(cgh);
                  extern dpct::device_memory<cl::sycl::float4, 1> psvx;
                  auto psvx_acc_ct1 = psvx.get_access(cgh);
                  extern dpct::device_memory<int, 1> psix;
                  auto psix_acc_ct1 = psix.get_access(cgh);
                  extern dpct::device_memory<float, 0> iderstpw;
                  auto iderstpw_acc_ct1 = iderstpw.get_access(cgh);
                  extern dpct::device_memory<float, 0> erstpw0;
                  auto erstpw0_acc_ct1 = erstpw0.get_access(cgh);
                  extern dpct::device_memory<float, 0> ideimfp;
                  auto ideimfp_acc_ct1 = ideimfp.get_access(cgh);
                  extern dpct::device_memory<float, 0> eimfp0;
                  auto eimfp0_acc_ct1 = eimfp0.get_access(cgh);
                  auto edens_tex_acc = edens_tex.get_access(cgh);
                  auto dens_tex_acc = dens_tex.get_access(cgh);
                  auto matid_tex_acc = matid_tex.get_access(cgh);
                  auto stpr_tex_acc = stpr_tex.get_access(cgh);
                  auto rstpw_tex_acc = rstpw_tex.get_access(cgh);
                  auto bstpw_tex_acc = bstpw_tex.get_access(cgh);
                  auto imfpIon_tex_acc = imfpIon_tex.get_access(cgh);
                  auto imfpPpe_tex_acc = imfpPpe_tex.get_access(cgh);
                  auto imfpPoe_tex_acc = imfpPoe_tex.get_access(cgh);
                  auto imfpPoi_tex_acc = imfpPoi_tex.get_access(cgh);
                  cgh.parallel_for<dpct_kernel_name<class proton_f8ff81>>(
                    cl::sycl::nd_range<3>((cl::sycl::range<3>(nblocks, 1, 1) * cl::sycl::range<3>(THRD_BLCK_PROTON, 1, 1)), cl::sycl::range<3>(THRD_BLCK_PROTON, 1, 1)),
                    [=](cl::sycl::nd_item<3> item_ct1) {
                      proton(dose_scorer_batched_array_ct0, energy_scorer_batched_array_ct1, LET_scorer_batched_array_ct2, LET_scorer_energy_weight_ct3, spectrum_scorer_batched_array_ct4, stop_scorer_batched_array_ct5, item_ct1, nactive_acc_ct1, x_acc_ct1, vx_acc_ct1, ix_acc_ct1, energybins_acc_ct1, energybins_min_acc_ct1, energybins_size_acc_ct1, ifScoreDose_acc_ct1, ifScoreEnergy_acc_ct1, ifScoreLET_acc_ct1, ifScoreSpectrum_acc_ct1, ifScoreStop_acc_ct1, scoreShifts_acc_ct1, scoreVoxSize_acc_ct1, scoreVoxVolume_acc_ct1, scoreVoxN_acc_ct1, scoreRotX_acc_ct1, ifDoseToWater_acc_ct1, epabs_acc_ct1, eemin_acc_ct1, stepmax0_acc_ct1, maxEdecay_acc_ct1, maxAllowedLET_acc_ct1, cuseed_acc_ct1, dx_acc_ct1, dy_acc_ct1, dz_acc_ct1, idx_acc_ct1, idy_acc_ct1, idz_acc_ct1, Unxvox_acc_ct1, Unyvox_acc_ct1, Unzvox_acc_ct1, scoreTotalVoxN_acc_ct1, idestpr_acc_ct1, estpr0_acc_ct1, npstk_acc_ct1, psx_acc_ct1, psvx_acc_ct1, psix_acc_ct1, iderstpw_acc_ct1, erstpw0_acc_ct1, ideimfp_acc_ct1, eimfp0_acc_ct1, edens_tex_acc, dens_tex_acc, matid_tex_acc, stpr_tex_acc, rstpw_tex_acc, bstpw_tex_acc, imfpIon_tex_acc, imfpPpe_tex_acc, imfpPoe_tex_acc, imfpPoi_tex_acc);
                    });
                });

from this:

proton<<<nblocks, THRD_BLCK_PROTON>>>(dose_scorer_batched_array, energy_scorer_batched_array, LET_scorer_batched_array, LET_scorer_energy_weight, spectrum_scorer_batched_array, stop_scorer_batched_array);

I therefore think all these problems may come from the excessive use of global variables in the CUDA code. So maybe I'll have to refactor the CUDA code into something of a more functional style first. But the difficulty of that refactoring will push this task to the back of my backlog.

I also found it a bit hard to find what I needed in the MKL library by documentation (and even find the documentation), in particular did the MKL-SYCL sparse BLAS seem very limited. With some grepping in the MKL include directory I got a bit further, but without really being sure of what I was doing. Maybe because I'm not particularly familiar with BLAS.

Thanks for the help so far.
I may try again when you release a new version.

/Andreas

0 Kudos
JenniferJ
Moderator
2,109 Views

Is it possible to send a testcase? 

Thanks again!

Jennifer 

0 Kudos
Reply