Segfault in MPI_Send using SYCL USM

Sidarth · ‎04-26-2024

Hello,

I am trying out example program to send and receive USM data across two different ranks connected to the same device and the code seg faults at MPI_Send call.

Here is the program:

#include <assert.h>
#include <mpi.h>
#include <sycl/sycl.hpp>

int main(int argc, char *argv[]) {

  /* -------------------------------------------------------------------------------------------
     MPI Initialization.
  --------------------------------------------------------------------------------------------*/

  MPI_Init(&argc, &argv);

  int size;
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (size != 2) {
    if (rank == 0) {
      printf("This program requires exactly 2 MPI ranks, but you are "
             "attempting to use %d! Exiting...\n",
             size);
    }
    MPI_Finalize();
    exit(0);
  }

  /* -------------------------------------------------------------------------------------------
      SYCL Initialization, which internally sets the CUDA device.
  --------------------------------------------------------------------------------------------*/

  sycl::queue q{};

  int tag = 0;
  const int nelem = 20;
  const size_t nsize = nelem * sizeof(int);
  std::vector<int> data(nelem, -1);

  /* -------------------------------------------------------------------------------------------
   Create SYCL USM in each rank.
  --------------------------------------------------------------------------------------------*/

  int *devp = sycl::malloc_device<int>(nelem, q);

  /* -------------------------------------------------------------------------------------------
   Perform the send/receive.
  --------------------------------------------------------------------------------------------*/

  if (rank == 0) {
    // Copy the data to the rank 0 device and wait for the memory copy to
    // complete.
    q.memcpy(devp, &data[0], nsize).wait();

    // Operate on the Rank 0 data.
    auto pf = [&](sycl::handler &h) {
      auto kern = [=](sycl::id<1> id) { devp[id] *= 2; };
      h.parallel_for(sycl::range<1>{nelem}, kern);
    };

    q.submit(pf).wait();

    MPI_Status status;
    // Send the data from rank 0 to rank 1.
    MPI_Send(devp, nsize, MPI_BYTE, 1, tag, MPI_COMM_WORLD);
    printf("Sent %d elements from %d to 1\n", nelem, rank);
  } else {
    assert(rank == 1);

    MPI_Status status;
    // Receive the data sent from rank 0.
    MPI_Recv(devp, nsize, MPI_BYTE, 0, tag, MPI_COMM_WORLD, &status);
    printf("received status==%d\n", status.MPI_ERROR);

    // Copy the data back to the host and wait for the memory copy to complete.
    q.memcpy(&data[0], devp, nsize).wait();

    sycl::free(devp, q);

    // Check the values.
    for (int i = 0; i < nelem; ++i)
      assert(data[i] == -2);
  }
  MPI_Finalize();
  return 0;
}

Here is the compile command:

mpiicpx -fsycl -fsycl-targets=spir64 -o n1_codeplay_sample_usm_orig n1_codeplay_sample_usm_orig.cpp

I tried it both oneAPI version 2023.2.1 and 2024.1.0 . The CPU is Intel Sapphire Rapids and the GPU is Intel(R) Data Center GPU Max 1100.

Could someone please point me to what I am missing here ?

Thanks,
Sidarth Narayanan

TobiasK · ‎04-26-2024

@Sidarth
Please always provide the output of I_MPI_DEBUG=10, your OS, and with GPUs also the GPU driver version.
I assume that you were able to run some non-MPI USM example and verified it works correctly?

You also have to enable GPU support by setting

I_MPI_OFFLOAD=1
Did you do that?

Can you run
mpirun -np 2 IMB-MPI1-GPU
?

Sidarth · ‎04-26-2024

Thank you for the response. Setting I_MPI_OFFLOAD=1 solved the issue with the sample code I had.

OS: Rocky 9
GPU: Intel(R) Data Center GPU Max 1100
GPU driver version:

snarayanan@intel-eagle:builds$ lspci -k | grep -EA3 'VGA|3D|Display'
02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
        DeviceName: ASPEED AST2600
        Subsystem: ASPEED Technology, Inc. ASPEED Graphics Family
        Kernel driver in use: ast
--
38:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
        Subsystem: NVIDIA Corporation Device 145f
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
--
da:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
        Subsystem: Intel Corporation Device 0000
        Kernel driver in use: i915
        Kernel modules: i915

But the command "mpirun -np 2 IMB-MPI1-GPU" results in a segfault:

snarayanan@intel-eagle:builds$ I_MPI_DEBUG=10 mpirun -np 2 IMB-MPI1-GPU
[0] MPI startup(): ===== GPU topology on intel-eagle.converge.global =====
[0] MPI startup(): NUMA nodes : 2
[0] MPI startup(): GPUs       : 1
[0] MPI startup(): Tiles      : 1
[0] MPI startup(): NUMA Id      GPU Id Tiles  Ranks on this NUMA
[0] MPI startup(): 0                         0
[0] MPI startup(): 1            0     (0)    1
[0] MPI startup(): ===== GPU pinning on intel-eagle.converge.global =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0    {0}
[0] MPI startup(): 1    {0}
[0] MPI startup(): Intel(R) MPI Library, Version 2021.12  Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPIDI_GPU_device_command_channel_init(): Device initiated communications: disabled
[0] MPI startup(): libfabric loaded: libfabric.so.1 
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[0] MPI startup(): shm segment size (1211 MB per rank) * (2 local ranks) = 2423 MB total
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp_10_x1.dat" not found
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp_10.dat" not found
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287) 
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575) 
[0] MPI startup(): Number of NICs:  1 
[0] MPI startup(): ===== NIC pinning on intel-eagle.converge.global =====
[0] MPI startup(): Rank    Pin nic
[0] MPI startup(): 0       eno0
[0] MPI startup(): 1       eno0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name                    Pin cpu
[0] MPI startup(): 0       1500036  intel-eagle.converge.global  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
                                                   30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
[0] MPI startup(): 1       1500037  intel-eagle.converge.global  {56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82
                                                   ,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,
                                                   107,108,109,110,111}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.12
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_OFFLOAD=1
[0] MPI startup(): I_MPI_INFO_GPU_ID_LOCAL_MAP=0,0
[0] MPI startup(): I_MPI_INFO_GPU_TILE_ID_LOCAL_MAP=0,0
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.7, MPI-1 part (GPU)
#----------------------------------------------------------------
# Date                  : Fri Apr 26 10:07:28 2024
# Machine               : x86_64
# System                : Linux
# Release               : 5.14.0-362.18.1.el9_3.0.1.x86_64
# Version               : #1 SMP PREEMPT_DYNAMIC Sun Feb 11 13:49:23 UTC 2024
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was:

# IMB-MPI1-GPU

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.44         0.00

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1500036 RUNNING AT intel-eagle.converge.global
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 1500037 RUNNING AT intel-eagle.converge.global
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Am I missing some config settings here ?

TobiasK · ‎04-26-2024

No that should work.
Can you please post the output of?
sycl-ls

Sidarth · ‎04-26-2024

Sure, Here is the sycl-ls output:

snarayanan@intel-eagle:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Genuine Intel(R) CPU 0000%@ OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.43.27642.40]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.27642]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.4]