- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying out example program to send and receive USM data across two different ranks connected to the same device and the code seg faults at MPI_Send call.
Here is the program:
#include <assert.h>
#include <mpi.h>
#include <sycl/sycl.hpp>
int main(int argc, char *argv[]) {
/* -------------------------------------------------------------------------------------------
MPI Initialization.
--------------------------------------------------------------------------------------------*/
MPI_Init(&argc, &argv);
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size != 2) {
if (rank == 0) {
printf("This program requires exactly 2 MPI ranks, but you are "
"attempting to use %d! Exiting...\n",
size);
}
MPI_Finalize();
exit(0);
}
/* -------------------------------------------------------------------------------------------
SYCL Initialization, which internally sets the CUDA device.
--------------------------------------------------------------------------------------------*/
sycl::queue q{};
int tag = 0;
const int nelem = 20;
const size_t nsize = nelem * sizeof(int);
std::vector<int> data(nelem, -1);
/* -------------------------------------------------------------------------------------------
Create SYCL USM in each rank.
--------------------------------------------------------------------------------------------*/
int *devp = sycl::malloc_device<int>(nelem, q);
/* -------------------------------------------------------------------------------------------
Perform the send/receive.
--------------------------------------------------------------------------------------------*/
if (rank == 0) {
// Copy the data to the rank 0 device and wait for the memory copy to
// complete.
q.memcpy(devp, &data[0], nsize).wait();
// Operate on the Rank 0 data.
auto pf = [&](sycl::handler &h) {
auto kern = [=](sycl::id<1> id) { devp[id] *= 2; };
h.parallel_for(sycl::range<1>{nelem}, kern);
};
q.submit(pf).wait();
MPI_Status status;
// Send the data from rank 0 to rank 1.
MPI_Send(devp, nsize, MPI_BYTE, 1, tag, MPI_COMM_WORLD);
printf("Sent %d elements from %d to 1\n", nelem, rank);
} else {
assert(rank == 1);
MPI_Status status;
// Receive the data sent from rank 0.
MPI_Recv(devp, nsize, MPI_BYTE, 0, tag, MPI_COMM_WORLD, &status);
printf("received status==%d\n", status.MPI_ERROR);
// Copy the data back to the host and wait for the memory copy to complete.
q.memcpy(&data[0], devp, nsize).wait();
sycl::free(devp, q);
// Check the values.
for (int i = 0; i < nelem; ++i)
assert(data[i] == -2);
}
MPI_Finalize();
return 0;
}
Here is the compile command:
mpiicpx -fsycl -fsycl-targets=spir64 -o n1_codeplay_sample_usm_orig n1_codeplay_sample_usm_orig.cpp
I tried it both oneAPI version 2023.2.1 and 2024.1.0 . The CPU is Intel Sapphire Rapids and the GPU is Intel(R) Data Center GPU Max 1100.
Could someone please point me to what I am missing here ?
Thanks,
Sidarth Narayanan
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sidarth
Please always provide the output of I_MPI_DEBUG=10, your OS, and with GPUs also the GPU driver version.
I assume that you were able to run some non-MPI USM example and verified it works correctly?
You also have to enable GPU support by setting
I_MPI_OFFLOAD=1
Did you do that?
Can you run
mpirun -np 2 IMB-MPI1-GPU
?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response. Setting I_MPI_OFFLOAD=1 solved the issue with the sample code I had.
OS: Rocky 9
GPU: Intel(R) Data Center GPU Max 1100
GPU driver version:
snarayanan@intel-eagle:builds$ lspci -k | grep -EA3 'VGA|3D|Display'
02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
DeviceName: ASPEED AST2600
Subsystem: ASPEED Technology, Inc. ASPEED Graphics Family
Kernel driver in use: ast
--
38:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
Subsystem: NVIDIA Corporation Device 145f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
--
da:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
Subsystem: Intel Corporation Device 0000
Kernel driver in use: i915
Kernel modules: i915
But the command "mpirun -np 2 IMB-MPI1-GPU" results in a segfault:
snarayanan@intel-eagle:builds$ I_MPI_DEBUG=10 mpirun -np 2 IMB-MPI1-GPU
[0] MPI startup(): ===== GPU topology on intel-eagle.converge.global =====
[0] MPI startup(): NUMA nodes : 2
[0] MPI startup(): GPUs : 1
[0] MPI startup(): Tiles : 1
[0] MPI startup(): NUMA Id GPU Id Tiles Ranks on this NUMA
[0] MPI startup(): 0 0
[0] MPI startup(): 1 0 (0) 1
[0] MPI startup(): ===== GPU pinning on intel-eagle.converge.global =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0 {0}
[0] MPI startup(): 1 {0}
[0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPIDI_GPU_device_command_channel_init(): Device initiated communications: disabled
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[0] MPI startup(): shm segment size (1211 MB per rank) * (2 local ranks) = 2423 MB total
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp_10_x1.dat" not found
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp_10.dat" not found
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi_tcp.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_spr_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287)
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575)
[0] MPI startup(): Number of NICs: 1
[0] MPI startup(): ===== NIC pinning on intel-eagle.converge.global =====
[0] MPI startup(): Rank Pin nic
[0] MPI startup(): 0 eno0
[0] MPI startup(): 1 eno0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1500036 intel-eagle.converge.global {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
[0] MPI startup(): 1 1500037 intel-eagle.converge.global {56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82
,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,
107,108,109,110,111}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.12
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_OFFLOAD=1
[0] MPI startup(): I_MPI_INFO_GPU_ID_LOCAL_MAP=0,0
[0] MPI startup(): I_MPI_INFO_GPU_TILE_ID_LOCAL_MAP=0,0
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.7, MPI-1 part (GPU)
#----------------------------------------------------------------
# Date : Fri Apr 26 10:07:28 2024
# Machine : x86_64
# System : Linux
# Release : 5.14.0-362.18.1.el9_3.0.1.x86_64
# Version : #1 SMP PREEMPT_DYNAMIC Sun Feb 11 13:49:23 UTC 2024
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1-GPU
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.44 0.00
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1500036 RUNNING AT intel-eagle.converge.global
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 1500037 RUNNING AT intel-eagle.converge.global
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Am I missing some config settings here ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No that should work.
Can you please post the output of?
sycl-ls
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, Here is the sycl-ls output:
snarayanan@intel-eagle:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Genuine Intel(R) CPU 0000%@ OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.43.27642.40]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.27642]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.4]
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page