Solved: GPU-aware MPI not working with IFX on Intel GPU

caplanr · ‎01-09-2026

Hi,

I am testing the code POT3D (github.com/predsci/pot3d) to see if it can run on a Intel B580 GPU. POT3D is a Fortran code that uses "do concurrent" for offload, along with OpenMP Target directives for data movement. I have previously been successful at running a similar code (HipFT) on a B580.

I am building using the intel_gpu_psi.conf configuration file that uses mpiifx with:

-O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -fopenmp-do-concurrent-maptype-modifier=present

I am using IFX version: 2025.2.2 20251210 on Ubuntu 24.04.3 LTS with kernel 6.14.0-37-generic

The code uses GPU-aware MPI calls and sets the pointers to the device versions. An example of this is:

!$omp target data use_device_addr(a)
call MPI_Isend (a(:,:,np-1),lbuf,ntype_real,iproc_pp,tag, comm_all,reqs(1),ierr)
call MPI_Isend (a(:,:, 2),lbuf,ntype_real,iproc_pm,tag, comm_all,reqs(2),ierr)
call MPI_Irecv (a(:,:, 1),lbuf,ntype_real,iproc_pm,tag, comm_all,reqs(3),ierr)
call MPI_Irecv (a(:,:,np),lbuf,ntype_real,iproc_pp,tag, comm_all,reqs(4),ierr)
call MPI_Waitall (4,reqs,MPI_STATUSES_IGNORE,ierr)
!$omp end target data

The code compiles fine, but when I try to run it, I get:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 366896 RUNNING AT 
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

If I try to activate GPU-aware MPI with

export I_MPI_OFFLOAD=1

the code just hangs. If I CRTL-C, I get:

forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
libc.so.6 0000763A39845330 Unknown Unknown Unknown
libc.so.6 0000763A3990E80B __sched_yield Unknown Unknown
libze_intel_gpu.s 0000763A35386FB6 Unknown Unknown Unknown
libze_intel_gpu.s 0000763A34F9C927 Unknown Unknown Unknown
libomptarget.so 0000763A3C68B526 Unknown Unknown Unknown
libomptarget.so 0000763A3C6B60BC Unknown Unknown Unknown
libomptarget.so 0000763A3C512BB8 Unknown Unknown Unknown
libomptarget.so 0000763A3C51A465 Unknown Unknown Unknown
libomptarget.so 0000763A3C51E4AB Unknown Unknown Unknown
libomptarget.so 0000763A3C4D5FA7 Unknown Unknown Unknown
libomptarget.so 0000763A3C4EF0F1 Unknown Unknown Unknown
libomptarget.so 0000763A3C4DC9A1 __tgt_target_kern Unknown Unknown
pot3d 0000000000435551 Unknown Unknown Unknown
pot3d 0000000000434685 Unknown Unknown Unknown
pot3d 0000000000430497 Unknown Unknown Unknown
pot3d 00000000004155D6 Unknown Unknown Unknown
pot3d 000000000040D71D Unknown Unknown Unknown
libc.so.6 0000763A3982A1CA Unknown Unknown Unknown
libc.so.6 0000763A3982A28B __libc_start_main Unknown Unknown
pot3d 000000000040D635 Unknown Unknown Unknown

One issue I could think of is that I use MPI calls with CPU arrays as well as with GPU arrays, with all the GPU MPI calls using use_device_addr. Could it be that the I_MPI_OFFLOAD environment variable is an "all or nothing" and either my CPU or GPU MPI calls will be wrong?

Note also that if I swap in the source file in the subfolder src/no_gpu_mpi/ which manually copies the GPU data back and forth around MPI calls, than the code runs correctly (but is slower than it should be due to the manual transfers).

This means it is an issue with the GPU arrays in the MPI calls.

Also note that even though I am only running on 1 GPU, the MPI calls are still used due to the periodic domain seam that uses MPI as well as some other calls.

Thanks!

- Ron Caplan

caplanr · ‎04-27-2026

Hi,

Update:

With the following ENV variables, the code runs correctly:

export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export I_MPI_OFFLOAD=1
export I_MPI_OFFLOAD_SYMMETRIC=1
export I_MPI_OFFLOAD_TOPOLIB=none
export I_MPI_OFFLOAD_DOMAIN_SIZE=1
#export LIBOMPTARGET_DEVICES=SUBDEVICE

(the last one is needed for MAX 1550 GPUs, but on my single B580 makes it not work).

This works on the 2025.2 compiler and the new 2026.0 compiler.

- Ron

View solution in original post

caplanr · ‎02-24-2026

Hi,

It has been a while since this post.

The issue remains.

Any idea on how to proceed?

- Ron

caplanr · ‎03-06-2026

Hi,

Here is some more information on reproducing this problem:

The code can be obtained at:

github.com/predsci/pot3d

Install and activate the Intel HPC SDK 2025.2.2 (the 2025.3 has a bug - that is a separate forum post).

You then need to have an hdf5 library compiled with the Intel compiler (before version 2.0.0; Version 1.14.3 is known to work).

To build for the Intel GPU, modify the file "conf/intel_gpu_psi.conf" to point to your installation of hdf5.

Then, run:

./build.sh conf/intel_gpu_psi.conf

You can then go to the "examples/potential_field_source_surface" folder and run the code with:

mpiexec -np 1 ../../bin/pot3d

For me, the run begins and then seg faults with:

### COMMENT from POT3D:
### Starting PCG solve.

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 321062 RUNNING AT matana
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

- Ron

caplanr · ‎04-27-2026

Hi,

Update:

With the following ENV variables, the code runs correctly:

export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export I_MPI_OFFLOAD=1
export I_MPI_OFFLOAD_SYMMETRIC=1
export I_MPI_OFFLOAD_TOPOLIB=none
export I_MPI_OFFLOAD_DOMAIN_SIZE=1
#export LIBOMPTARGET_DEVICES=SUBDEVICE

(the last one is needed for MAX 1550 GPUs, but on my single B580 makes it not work).

This works on the 2025.2 compiler and the new 2026.0 compiler.

- Ron