Re:How to port some cuda APIs

Shukla__Gagandeep · ‎09-15-2020

Hi,

I have some cuda code that I am trying to port. It is using cuda functions like cuInit, cuDeviceGet, cuDeviceGetAttribute, cuCtxCreate, cuCtxEnablePeerAccess, cuModuleLoadData, cuModuleGetFunction, cuFuncSetCacheConfig, cuLaunchKernel, cuCtxSynchronize, cuMemAlloc, cuMemGetInfo, cuArray3DCreate, cuTexObjectCreate, cuTexObjectDestroy.

Since sycl code gets directly compiled into binary, I understand that I do not need cuModuleXxx functions as I do not need to load anything dynamically. For cuLaunchKernel, I understand that I need to call .submit on sycl::queue or dpct::get_default_queue_wait() to start the kernel execution. For cuDeviceGet, I can use dpct::dev_mgr::instance().get_device(0). For cuDeviceGetAttributes, I can use dpct::dev_mgr::instance().get_device(0).get_device_info(dpct::device_info). For cuMemAlloc, I can use sycl::malloc_device/sycl::free. Is that correct?

Since dpct failed to convert these functions, I would appreciate if you could help me with some pointers on what I can use instead. And any documentation on how to create textures would be helpful too.

Regards,
Gagan

RahulV_intel · ‎09-16-2020

Hi,

Yes, that's correct. The DPCT/DPC++ functions that you have mentioned are roughly equivalent to their CUDA counterpart.

Coming to texture memory, DPCT partially supports texture memory API calls. Internally, texture memory gets mapped to SYCL image (via dpct::image).

It would be really helpful if you could send a small CUDA source file containing the CUDA APIs that fail to migrate.

Thanks,

Rahul

Shukla__Gagandeep · ‎09-22-2020

Here are some code excerpts:

    /* Check if the device has P2P access to any other device in the system. */
    for (int peer_num = 0; peer_num < count && !info.has_peer_memory; peer_num++) {
      if (num != peer_num) {
        int can_access = 0;
        cuDeviceCanAccessPeer(&can_access, num, peer_num);
        info.has_peer_memory = (can_access != 0);
      }
    }

    int pci_location[3] = {0, 0, 0};
    cuDeviceGetAttribute(&pci_location[0], CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID, num);
    cuDeviceGetAttribute(&pci_location[1], CU_DEVICE_ATTRIBUTE_PCI_BUS_ID, num);
    cuDeviceGetAttribute(&pci_location[2], CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID, num);

    result = cuCtxCreate(&cuContext, ctx_flags, cuDevice);

  // Ensure array access over the link is possible as well (for 3D textures)
  cuda_assert(cuDeviceGetP2PAttribute(&can_access,
                                      CU_DEVICE_P2P_ATTRIBUTE_ARRAY_ACCESS_ACCESS_SUPPORTED,
                                      cuDevice,
                                      peer_device_cuda->cuDevice));
  if (can_access == 0) {
    return false;
  }

    int result = cuCtxEnablePeerAccess(peer_device_cuda->cuContext, 0);

    CUmodule cuModule
    result = cuModuleLoadData(&cuModule, cubin_data.c_str());

    cuModuleGetFunction(
      &functions.adaptive_stopping, cuModule, "kernel_cuda_adaptive_stopping")
    cuFuncSetCacheConfig(functions.adaptive_stopping, CU_FUNC_CACHE_PREFER_L1)

Code given above is from blender repo files - intern/cycles/device/device_cuda.cpp and intern/cycles/device/cuda/device_cuda_impl.cpp respectively.

I understand that I dont need to explicitly load cubin files and compile them at run time as sycl code is already compiled. But is there a way to specify cache cofiguration as done by cuFuncSetCacheConfig

And is there some function like cuDeviceGetP2PAttribute and cuCtxEnablePeerAccess ?

Regards,
Gagan

RahulV_intel · ‎09-24-2020

Hi,

Could you please try out with the latest beta09 release and let me know if there is any change?

Thanks,

Rahul

Shukla__Gagandeep · ‎09-25-2020

Hi Rahul,

I'm trying to convert cycles library from blender. I tried beta09 (on ubuntu 18.04). I'm using same code, same folder structure and same compile_commands.json file that I used with beta08 to convert the project.

dpct stops converting randomly (no progress messages on terminal even after waiting for 30+ minutes) and I have seen it atleast 7-8 times since yesterday. There is no output in output folder, not even log file so I have no idea what happened.

There are a couple of cuda files in this library cycles/kernel/kernels/cuda/filter.cu and cycles/kernel/kernels/cuda/kernel.cu . dpct stops responding when it reaches them. I had to remove first cuda file to get to second and then remove second too for it to proceed.

Few times it failed with SIGABRT. I have attached the log files (.diags.log and conversion log file). Error message on terminal is:

...
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/render/denoising.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/kernel/kernels/cpu/filter.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/util/util_debug.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/blender/blender_geometry.cpp
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/opencolorio/ocio_impl_glsl.cc
Processing: /home/kuljeet/Downloads/repos/Blender09/intern/cycles/kernel/kernels/cpu/kernel_split_avx2.cpp
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create

dpct error: meet signal:SIGABRT Intel(R) DPC++ Compatibility Tool trys to write analysis reports and terminates...

Error message is not really much useful.

Can't comment on converted code quality as I have not been able to get it to convert successfully.

Command used for conversion:

dpct --cuda-include-path=/usr/local/cuda/include -p=compile_commands.json --out-root=dpctx --in-root=. --output-file=conv_errs.txt

Regards,
Gagan

Shukla__Gagandeep · ‎09-27-2020

Hi Rahul,

Tried converting the code with beta09 but I see no change. All the functions related to cuda setup are let as it is in the converted code.

So is there any parallel functions for: cuDeviceCanAccessPeer, cuDeviceGetP2PAttribute, cuCtxEnablePeerAccess, cuOccupancyMaxPotentialBlockSize, cuCtxSynchronize

I also see function used in cuda code to get total memory and free memory info. I understand that I can use cl::sycl::info::device::max_mem_alloc_size or cl::sycl::info::device::global_mem_size/ cl::sycl::info::device::local_mem_size to get some information about memory but is there a way to get free memory info after launching a kernel to see how much memory it consumed. Cuda function is: cuMemGetInfo(&free_before, &total);

Any documentation link would be helpful too.

Regards,
Gagan

PS: Converted code is attached. There is not much conversion taking place for these files. It isn't cuda code per se but cuda env setup code so may be dpct wasn't expected to much.

RahulV_intel · ‎09-28-2020

Hi,

cuCtxSynchronize(): CUDA contexts roughly map to SYCL contexts. This particular API essentially blocks the device until the previous tasks gets completed. In my opinion, queue.wait() function would be a good equivalent to this, since every queue has a particular context associated with it.

cuMemGetInfo(): SYCL/DPC++ runtime automatically takes care of the memory management in case of a buffer/accessor model i.e data copy to/from the device is completely abstracted. Please refer to the SYCL specs 1.2.1.

For other CUDA APIs, DPCT subject matter experts(SME) will get in touch with you shortly.

Thanks,

Rahul

Varsha_M_Intel · ‎10-05-2020

We are currently working on migration of some CUDA Driver APIs.

Regarding your question on free memory, we do not have any API to the information at runtime.

We are working on this as well.

RahulV_intel · ‎06-13-2021

Hi Gagan,

Could you try migrating with the latest oneAPI (2021.2) version and let us know if it works?

Thanks,

Rahul

RahulV_intel · ‎06-20-2021

Hi Gagan,

Do you have any updates on this?

Thanks,

Rahul

RahulV_intel · ‎06-27-2021

Hi,

I have not heard back from you; we won’t be monitoring this thread. If you need further assistance, please post a new thread.

Regards,

Rahul