Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
75 Views

How to port some cuda APIs

Hi,

I have some cuda code that I am trying to port. It is using cuda functions like cuInit, cuDeviceGet, cuDeviceGetAttribute, cuCtxCreate, cuCtxEnablePeerAccess, cuModuleLoadData, cuModuleGetFunction, cuFuncSetCacheConfig, cuLaunchKernel, cuCtxSynchronize, cuMemAlloc, cuMemGetInfo, cuArray3DCreate, cuTexObjectCreate, cuTexObjectDestroy.

Since sycl code gets directly compiled into binary, I understand that I do not need cuModuleXxx functions as I do not need to load anything dynamically. For cuLaunchKernel, I understand that I need to call .submit on sycl::queue or dpct::get_default_queue_wait() to start the kernel execution. For cuDeviceGet, I can use dpct::dev_mgr::instance().get_device(0). For cuDeviceGetAttributes, I can use dpct::dev_mgr::instance().get_device(0).get_device_info(dpct::device_info). For cuMemAlloc, I can use sycl::malloc_device/sycl::free. Is that correct?

Since dpct failed to convert these functions, I would appreciate if you could help me with some pointers on what I can use instead. And any documentation on how to create textures would be helpful too.

Regards,
Gagan

0 Kudos
6 Replies
Highlighted
Moderator
60 Views

Hi,


Yes, that's correct. The DPCT/DPC++ functions that you have mentioned are roughly equivalent to their CUDA counterpart.


Coming to texture memory, DPCT partially supports texture memory API calls. Internally, texture memory gets mapped to SYCL image (via dpct::image).


It would be really helpful if you could send a small CUDA source file containing the CUDA APIs that fail to migrate.



Thanks,

Rahul


0 Kudos
Highlighted
44 Views

Here are some code excerpts:

 

    /* Check if the device has P2P access to any other device in the system. */
    for (int peer_num = 0; peer_num < count && !info.has_peer_memory; peer_num++) {
      if (num != peer_num) {
        int can_access = 0;
        cuDeviceCanAccessPeer(&can_access, num, peer_num);
        info.has_peer_memory = (can_access != 0);
      }
    }

    int pci_location[3] = {0, 0, 0};
    cuDeviceGetAttribute(&pci_location[0], CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID, num);
    cuDeviceGetAttribute(&pci_location[1], CU_DEVICE_ATTRIBUTE_PCI_BUS_ID, num);
    cuDeviceGetAttribute(&pci_location[2], CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID, num);
    result = cuCtxCreate(&cuContext, ctx_flags, cuDevice);

  // Ensure array access over the link is possible as well (for 3D textures)
  cuda_assert(cuDeviceGetP2PAttribute(&can_access,
                                      CU_DEVICE_P2P_ATTRIBUTE_ARRAY_ACCESS_ACCESS_SUPPORTED,
                                      cuDevice,
                                      peer_device_cuda->cuDevice));
  if (can_access == 0) {
    return false;
  }

    int result = cuCtxEnablePeerAccess(peer_device_cuda->cuContext, 0);

    CUmodule cuModule
    result = cuModuleLoadData(&cuModule, cubin_data.c_str());

    cuModuleGetFunction(
      &functions.adaptive_stopping, cuModule, "kernel_cuda_adaptive_stopping")
    cuFuncSetCacheConfig(functions.adaptive_stopping, CU_FUNC_CACHE_PREFER_L1)

 

Code given above is from blender  repo files - intern/cycles/device/device_cuda.cpp and intern/cycles/device/cuda/device_cuda_impl.cpp respectively.

I understand that I dont need to explicitly load cubin files and compile them at run time as sycl code is already compiled. But is there a way to specify cache cofiguration as done by cuFuncSetCacheConfig

And is there some function like cuDeviceGetP2PAttribute and cuCtxEnablePeerAccess ?

Regards,
Gagan

 

0 Kudos
Highlighted
Moderator
35 Views

Hi,


Could you please try out with the latest beta09 release and let me know if there is any change?



Thanks,

Rahul


0 Kudos
Highlighted
25 Views

Hi Rahul,

I'm trying to convert cycles library from blender. I tried beta09 (on ubuntu 18.04).  I'm using same code, same folder structure and same compile_commands.json file that I used with beta08 to convert the project.

dpct stops converting randomly (no progress messages on terminal even after waiting for 30+ minutes) and I have seen it atleast 7-8 times since yesterday. There is no output in output folder, not even log file so I have no idea what happened.

There are a couple of cuda files in this library cycles/kernel/kernels/cuda/filter.cu and cycles/kernel/kernels/cuda/kernel.cu . dpct stops responding when it reaches them. I had to remove first cuda file to get to second and then remove second too for it to proceed.

Few times it failed with SIGABRT. I have attached the log files (.diags.log and conversion log file). Error message on terminal is:

...
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/render/denoising.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/kernel/kernels/cpu/filter.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/util/util_debug.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/blender/blender_geometry.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/opencolorio/ocio_impl_glsl.cc
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/kernel/kernels/cpu/kernel_split_avx2.cpp
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create

dpct error: meet signal:SIGABRT Intel(R) DPC++ Compatibility Tool trys to write analysis reports and terminates...

Error message is not really much useful.

Can't comment on converted code quality as I have not been able to get it to convert successfully.

Command used for conversion:

dpct --cuda-include-path=/usr/local/cuda/include -p=compile_commands.json --out-root=dpctx --in-root=. --output-file=conv_errs.txt

Regards,
Gagan

 

0 Kudos
Highlighted
16 Views

Hi Rahul,

Tried converting the code with beta09 but I see no change. All the functions related to cuda setup are let as it is in the converted code.

So is there any parallel functions for: cuDeviceCanAccessPeer, cuDeviceGetP2PAttribute, cuCtxEnablePeerAccess, cuOccupancyMaxPotentialBlockSize, cuCtxSynchronize

I also see function used in cuda code to get total memory and free memory info. I understand that I can use cl::sycl::info::device::max_mem_alloc_size or cl::sycl::info::device::global_mem_size/ cl::sycl::info::device::local_mem_size to get some information about memory but is there a way to get free memory info after launching a kernel to see how much memory it consumed. Cuda function is: cuMemGetInfo(&free_before, &total);

Any documentation link would be helpful too.

Regards,
Gagan

PS: Converted code is attached. There is not much conversion taking place for these files. It isn't cuda code per se but cuda env setup code so may be dpct wasn't expected to much.

0 Kudos
Highlighted
Moderator
9 Views

Hi,

 

cuCtxSynchronize(): CUDA contexts roughly map to SYCL contexts. This particular API essentially blocks the device until the previous tasks gets completed. In my opinion, queue.wait() function would be a good equivalent to this, since every queue has a particular context associated with it.

 

 cuMemGetInfo(): SYCL/DPC++ runtime automatically takes care of the memory management in case of a buffer/accessor model i.e data copy to/from the device is completely abstracted. I'm not quite sure if SYCL/DPC++ provides such functionality.

 

You may refer to the link below for SYCL 1.2.1 specs:

https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf

 

For P2P data access, you may rely on USM for the data movement.

 

Please note that I have escalated this query to DPCT SMEs. They will get back to you shortly.

 

Thanks,

Rahul

 

0 Kudos