Solved: How to use cudaMalloc'ed buffers for GPU-aware MPI communication

kvoronin · ‎07-25-2024

Hello!

I am trying to get Intel MPI work on Nvidia GPUs. Specifically, I need to be able to call MPI primitives (say, MPI_Reduce) with device buffers (allocated via cudaMalloc). I read https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/gpu-buffers-support.html and it seems that I should be able to pass device buffers to Intel MPI when I use a pragma omp target data region?

If using cudaMalloc'ed buffers directly is not possible, but the data is in cudaMalloc buffers, is there a zero-copy way to pass those device buffers (maybe transformed) to an MPI call?

Software:

oneAPI (Base toolkit + HPC toolkit): 2024.2.0

Also, I've manually installed the Codeplay's plugin for Nvidia GPUs, and `sycl-ls` outputs for me

[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8480CL OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA H100 80GB HBM3 9.0 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA H100 80GB HBM3 9.0 [CUDA 12.4]

Hardware:

CPU: x86 (SPR)

GPUs: A100/H100 Nvidia GPU

Should the following snippet work? If not, what is the semantically equivalent correct code?

  float* x_d = NULL;                                                                                                                                                        
  cuda_error = cudaMalloc(&x_d, 1 * sizeof(float));                                                                                                                         
                                                                                                                                                                            
  float* res_d = NULL;                                                                                                                                                      
  cuda_error = cudaMalloc(&res_d, 1 * sizeof(float));                                                                                                                       
                                                                                                                                                                            
  #pragma omp target is_device_ptr(x_d, res_d)                                                                                                                              
  {                                                                                                                                                                         
      mpi_error = MPI_Reduce(x_d, res_d, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);                                                                                         
      if (mpi_error) { printf("Error: mpi_error @3 = %d at rank = %d\n", mpi_error, rank); fflush(0); }                                                                     
  }

Code is compiled with

mpiicpx -g -O0 -I${CUDA_PATH}/include ${test}.cpp -L${CUDA_PATH}/lib64 -lcudart -o ${test}.out

and run with

export LD_LIBRARY_PATH=${CUDA_PATH}/lib64:${LD_LIBRARY_PATH}                                                                                                                
export I_MPI_DEBUG=120                                                                                                                                                      
export I_MPI_OFFLOAD_MODE=cuda
mpirun -n 2 ./${test}.out

For me the code produces segfault on all (two) ranks.

Thanks,
Kirill

TobiasK · ‎07-30-2024

@kvoronin sorry Intel's OpenMP offload does only work with Intel GPUs, that's why you get zero devices found in the log.

View solution in original post

TobiasK · ‎07-29-2024

@kvoronin Can you please also set I_MPI_OFFLOAD=1?

kvoronin · ‎07-29-2024

Hi @TobiasK,

Thanks for the answer and a helping tip!

EDIT: this was a messed up environment for what was below. Deleting the contents.

Thaks,
Kirill

kvoronin · ‎07-29-2024

Hi @TobiasK,

Thanks for the tip!

With I_MPI_OFFLOAD=1, the code works with cudaMalloc'ed buffers without pragmas. But when having #pragma omp target is_device_ptr(x_d, res_d) it still segfaults.

Any further advice?

Also, another question:

What would be the (functional / performance) difference between having a pure MPI_Reduce with cudaMalloc'ed buffers vs writing the #pragma omp target and enclose the MPI_Reduce call?

I don't necessarily want to use pragmas, having any performant way which works would be fine.

Full segfault'ing run log (with I_MPI_OFFLOAD=1)

omptarget --> Init offload library!
OMPT --> Entering connectLibrary (libomp)
OMPT --> OMPT: Trying to load library libiomp5.so
omptarget --> Init offload library!
OMPT --> Entering connectLibrary (libomp)
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Trying to load library libiomp5.so
OMPT --> OMPT: Library connection handle = 0x7f38f01003c0
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7fb09f9003c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
OMPT --> Exiting connectLibrary (libomp)
omptarget --> Loading RTLs...
OMPT --> Exiting connectLibrary (libomp)
omptarget --> Loading RTLs...
omptarget --> Attempting to load library 'libomptarget.rtl.level0.so'...
omptarget --> Attempting to load library 'libomptarget.rtl.level0.so'...
omptarget --> Unable to load library 'libomptarget.rtl.level0.so': libze_loader.so.1: cannot open shared object file: No such file or directory!
omptarget --> Attempting to load library 'libomptarget.rtl.opencl.so'...
omptarget --> Unable to load library 'libomptarget.rtl.level0.so': libze_loader.so.1: cannot open shared object file: No such file or directory!
omptarget --> Attempting to load library 'libomptarget.rtl.opencl.so'...
omptarget --> Successfully loaded library 'libomptarget.rtl.opencl.so'!
omptarget --> Successfully loaded library 'libomptarget.rtl.opencl.so'!
Target OPENCL RTL --> Init OpenCL plugin!
Target OPENCL RTL --> Target device type is set to GPU
Target OPENCL RTL --> OMPT: Entering connectLibrary (libomptarget)
OMPT --> OMPT: Trying to load library libomptarget.so
Target OPENCL RTL --> Init OpenCL plugin!
Target OPENCL RTL --> Target device type is set to GPU
OMPT --> Target OPENCL RTL --> OMPT: Entering connectLibrary (libomptarget)
OMPT --> OMPT: Trying to load library libomptarget.so
OMPT: Trying to get address of connection routine ompt_libomptarget_connect
OMPT --> OMPT: Library connection handle = 0x7f38efc4d4c0
OMPT --> Enter ompt_libomptarget_connect
OMPT --> Leave ompt_libomptarget_connect
Target OPENCL RTL --> OMPT: Exiting connectLibrary (libomptarget)
Target OPENCL RTL --> Start initializing OpenCL
OMPT --> OMPT: Trying to get address of connection routine ompt_libomptarget_connect
OMPT --> OMPT: Library connection handle = 0x7fb09f44d4c0
OMPT --> Enter ompt_libomptarget_connect
OMPT --> Leave ompt_libomptarget_connect
Target OPENCL RTL --> OMPT: Exiting connectLibrary (libomptarget)
Target OPENCL RTL --> Start initializing OpenCL
Target OPENCL RTL --> WARNING: No OpenCL devices found.
omptarget --> No devices supported in this RTL
omptarget --> Attempting to load library 'libomptarget.rtl.x86_64.so'...
Target OPENCL RTL --> WARNING: No OpenCL devices found.
omptarget --> No devices supported in this RTL
omptarget --> Attempting to load library 'libomptarget.rtl.x86_64.so'...
omptarget --> Unable to load library 'libomptarget.rtl.x86_64.so': libffi.so: cannot open shared object file: No such file or directory!
omptarget --> RTLs loaded!
omptarget --> No RTL found for image 0x00000000004024c0!
omptarget --> Done registering entries!
omptarget --> Unable to load library 'libomptarget.rtl.x86_64.so': libffi.so: cannot open shared object file: No such file or directory!
omptarget --> RTLs loaded!
omptarget --> No RTL found for image 0x00000000004024c0!
omptarget --> Done registering entries!
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fb09f8eef00 0x00007fb09f8ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f38f00eef00 0x00007f38f00ee7c0
rank = 1 is here (size = 2)
rank = 0 is here (size = 2)
res = 1.000
Success: correct result res = 1.000
Device Number: 0
Device name: NVIDIA H100 80GB HBM3
Device Number: 1
Device name: NVIDIA H100 80GB HBM3
rank = 0; cuda_error before MPI_Reduce with GPU buffers = 0
rank = 1; cuda_error before MPI_Reduce with GPU buffers = 0
omptarget --> Call to omp_get_num_devices returning 0
omptarget --> Entering target region for device 0 with entry point 0x0000000000402350
omptarget --> Call to omp_get_num_devices returning 0
omptarget --> Entering target region for device 0 with entry point 0x0000000000402350
omptarget --> Call to omp_get_num_devices returning 0
omptarget --> omp_get_num_devices() == 0 but offload is manadatory
omptarget --> Call to omp_get_num_devices returning 0
omptarget --> omp_get_num_devices() == 0 but offload is manadatory
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: No images found compatible with the installed hardware. omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: No images found compatible with the installed hardware. Found 0 image(s): ()
mpi_reduce_offload_example.cpp:14:119: omptarget fatal error 1: failure of target construct while offloading is mandatory
Found 0 image(s): ()
mpi_reduce_offload_example.cpp:14:119: omptarget fatal error 1: failure of target construct while offloading is mandatory

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1621890 RUNNING AT <node name>
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 1621891 RUNNING AT <node name>
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Thanks,
Kirill

TobiasK · ‎07-30-2024

@kvoronin sorry Intel's OpenMP offload does only work with Intel GPUs, that's why you get zero devices found in the log.

kvoronin · ‎07-30-2024

Hi @TobiasK

Thanks for the reply! I'd like to note that the docs are pretty ambiguous about that and what is supposed to work.

One more question: in terms of performance, how would GPU aware OpenMPI compare to Intel MPI when passing around the cudaMalloc'ed device buffers? Are there significant differences like extra copies in one or the other implementation?

Or, if you don't know about OpenMPI, how would the GPU-to-GPU transfer work with Intel CPU + Nvidia GPU (from the side of MPI library; I understand that the HW details like NVLink or PCIe can matter here)?

Thanks,
Kirill

TobiasK · ‎08-02-2024

@kvoronin for the time being Intel MPI support for Intel GPUs is more mature. At the moment, we only support passing CUDA buffers to the library. We are working on GDR copy and more advanced features which will be included in a future version.

For the examples shown, they are still valid. However, as you have to make sure that the SYCL code runs on your device, you also have to make sure that the OpenMP code runs on your device. For SYCL that results in adding the plugins and for OpenMP you have to compile with the vendors toolchain since we do not provide support for GPUs other than Intel GPUs in our OpenMP runtime.

kvoronin · ‎08-02-2024

Thanks @TobiasK a lot for the answers!

I think this is all I wanted to ask here. I am experiencing an issue with cudaMalloc'ed buffers passed to Intel MPI when the buffers are big but I'll open a separate ticket for this issue if I manage to reproduce it outside the application.

This thread can be closed.

Thanks,
Kirill