Solved: Re: Can multiple MPI ranks queue work to one GPU Xe = i7-1185G7

TedW · ‎05-22-2023

I'm testing incrementally with a C routine called by 4 MPI worker ranks (called from Fortran code) which each offload computational tasks using queues, and it eventually fails with segfault in host and gpu selections. These routines are not optimal for the application right now, a test for functionality. They are called a great many times, successfully for a while, then eventually segfault.

A functionally similar C routine with a great deal more work loaded by a queue works perfectly called once by 1 thread from 1 process.

I see advice that 1 GPU or sub-device should be dedicated for each MPI rank/process, advice which will be certainly be followed in the end, but I seem to be unable to partition this Xe GPU (a development laptop, i7-1185G7) in ways suggested.

Is there a way to use this GPU with multiple MPI ranks/processes for what is testing/development of a code right now on a little machine I have in hand, or is this futile; or is there something I'm likely missing?

Many thanks!

Gregg_S_Intel · ‎05-24-2023

Don't try to create multiple devices. Submit all the work to default device 0.

Your device code references host pointers.

sh_g0_1[jk] = (p0l_1[jk+16] - p0l_1[jk]) / (rho0_1[jk] * dx);

Allocate device memory "malloc_device<float>(size, q)" for these arrays, and q.memcpy them to the device.

View solution in original post

RabiyaSK_Intel · ‎05-23-2023

Hi,

Thanks for posting in Intel Communities.

>>>I'm testing incrementally with a C routine called by 4 MPI worker ranks (called from Fortran code) which each offload computational tasks using queues, and it eventually fails with segfault in host and gpu selection

Could you please confirm if you are offloading computational tasks with CUDA or SYCL?

Could you also provide the below details, so that we could reproduce your issue at out end:

1. The sample reproducer along with steps to reproduce

2. CPU, OS and hardware details

3. Intel oneAPI HPC Toolkit and Intel MPI version

Thanks & Regards,

Shaik Rabiya

TedW · ‎05-23-2023

Hi Shaik,

To highlight, my question is about development using a small Xe, partitioning or emulation to be suitable for development. Here are details:

1. The sample reproducer along with steps to reproduce

attached, with the 3 files in working dir: ./compile_mpi_gpu (renamed below with a ".txt extension to allow attaching)

+ stdout/err from a failure

2. CPU, OS and hardware details

Model name: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz, 4core(8) dell xps-13 laptop, 96 eu Xe

Ubuntu 20.04.4 LTS

3. Intel oneAPI HPC Toolkit and Intel MPI version

ted@ted-XPS-13-9310:~/srend/gpu/code$ icx -v
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)

ted@ted-XPS-13-9310:~/srend/gpu/code$ ifx -v
ifx version 2023.1.0

ted@ted-XPS-13-9310:~/srend/gpu/code$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.9 Build 20230307 (id: d82b3071db)

I've tried several high levels ways to partition, like:

CreateMultipleRootDevices=4 NEOReadDebugKeys=1 SYCL_DEVICE_FILTER=level_zero sycl-ls

Yet, partitioning does not apply, as reported I see:

ted@ted-XPS-13-9310:~/srend/gpu/code$ mpirun -n 4 ./mpi_gpu
[0] MPI startup(): ===== GPU pinning on ted-XPS-13-9310 =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0 {0}
[0] MPI startup(): 1 {0}
[0] MPI startup(): 2 {0}
[0] MPI startup(): 3 {0}
[0] MPI startup(): Intel(R) MPI Library, Version 2021.9 Build 20230307 (id: d82b3071db)

The code attached is stripped down to something small with functionally the same kind of errors. If this kind of development can't be done on with this particular processor (others I work with have one), that would be good to know so we can look for something else. Similar code with 1 process has been fine for 2 years doing functionally the same thing, q.submit from C/sycle/dpc++ in a C routine called from Fortran code.

Many thanks!

Ted

TedW · ‎05-23-2023

An error to correct. I stripped off too much.

Need this before the malloc_shared:

ivecw = 32;

But, fault and exit happens on q.submit before that matters.

I've attached gpu.cpp with that line corrected.

Ted

Gregg_S_Intel · ‎05-24-2023

Don't try to create multiple devices. Submit all the work to default device 0.

Your device code references host pointers.

sh_g0_1[jk] = (p0l_1[jk+16] - p0l_1[jk]) / (rho0_1[jk] * dx);

Allocate device memory "malloc_device<float>(size, q)" for these arrays, and q.memcpy them to the device.

TedW · ‎05-25-2023

MPI works with that change, and with other methods. This is helpful! I was confused over what [=] does and does not do on q.submit, and extras that ifx and icpx do. Apparently, an "!$omp allocate allocator(omp_target_shared_mem_alloc)" directive to allocate an allocatable array in the Fortran code provides extra information over to a called C program so that the array is known to be in device memory already by passing the reference, no copy in, but an ordinary Fortran array does not get such special treatment like copy in when passed by reference to C. I had wrongly assumed Fortran arrays were special cases. Thanks!

RabiyaSK_Intel · ‎05-25-2023

Hi,

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Shaik Rabiya

Gregg_S_Intel · ‎05-25-2023

Device data is associated with a specific queue, which is why the queue is passed to malloc_device and malloc_shared. The Fortran OpenMP doesn't know about your SYCL queue. OpenMP creates its own queue.

Although we can mix OpenMP Offload with SYCL in a program, we cannot combine them. That is, OpenMP device data is not available to a SYCL queue and vice-versa.

TedW · ‎05-25-2023

It appears to be that the !$omp allocate allocator(omp_target_shared_mem_alloc) directive in Fortran means that the array can be passed by reference then used directly in a C CYCL queue.

I've attached an example which does this with: #define CUSM 0. (The script compile_mpi_gpu.txt needed the extension to drop in.)

It works in other code for volume rendering the same way, Fortran to C CYCL code, nice images out and quickly.

This seems to be an impressively wonderful feature. I was hoping to exploit this one a lot.

Ted

Gregg_S_Intel · ‎05-26-2023

The official statement is, "The direct interaction between OpenMP and SYCL runtime libraries are not supported at this time. For example, a device memory object created by OpenMP API is not accessible by SYCL code. That is, using the device memory object created by OpenMP in SYCL code results unspecified execution behavior."

https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-1/c-c-openmp-and-sycl-composability.htm

The memcpy in gpu.cpp should be q.memcpy. It works anyway, because this is malloc_shared memory, but q.memcpy would be more efficient.

Using malloc_shared is okay where necessary in a complex code, but in general malloc_device is preferable for performance.