Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
29313 Discussions

Newest version of IFX seg faults on run that used to work

caplanr
New Contributor II
4,509 Views

Hi,

 

I recently updated my Intel driver and OneAPI HPC SDK.

I am running on an Intel Arc B580 GPU.

My system has Ubuntu 22.04 with kernel: 6.13.12-zabbly+

My IFX version is:  ifx (IFX) 2025.2.1 20250806

My MPI is /opt/intel/oneapi/mpi/2021.16

 

I am running the code HipFT which can be obtained here:  github.com/predsci/hipft

I am using the latest commit.

 

The compiler flags I am using are:

-O3 -xHost -fp-model precise -fopenmp-target-do-concurrent -fiopenmp

-fopenmp-targets=spir64 -fopenmp-do-concurrent-maptype-modifier=present

 

The code uses Fortran's "do concurrent" to offload computation to the GPU, and uses OpenMP Target directives to manually manage the CPU-GPU data transfers.  See https://ieeexplore.ieee.org/document/10820592 for details.

 

The code compiles fine.

 

The code runs correctly for the testsuite runs included in the git repo (these are small runs).

 

However, for the example run in "/examples/flux_transport_1rot_flowAa_diff_r8" the code seg faults with:

 

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F2C41042520 Unknown Unknown Unknown
hipft 000000000042821D Unknown Unknown Unknown
hipft 000000000041E8F4 Unknown Unknown Unknown
hipft 000000000040FAFE Unknown Unknown Unknown
hipft 000000000040E8BA Unknown Unknown Unknown
hipft 000000000040E72D Unknown Unknown Unknown
libc.so.6 00007F2C41029D90 Unknown Unknown Unknown
libc.so.6 00007F2C41029E40 __libc_start_main Unknown Unknown
hipft 000000000040E645 Unknown Unknown Unknown

 

This run uses more VRAM than the testsuite runs, but otherwise is running the same algorithms.

 

The run used to work as shown in the paper linked above (it also works on NVIDIA GPUs with the same amount of VRAM).

 

To reproduce this, compile the hdf5 library with the IFX compiler and then set the paths to the library in the build configuration file (see conf/intel_gpu_psi.conf for an example).

Then, build the code with "build.sh <CONFFILE>" and then the code can be run in the examples/flux_transport_1rot_flowAa_diff_r8 directory with "mpiexec -np 1 ../bin/hipft "

 

Thanks!

 

 - Ron

 

 

0 Kudos
1 Solution
caplanr
New Contributor II
1,888 Views

HI,

 

That sounds like a reasonable explanation.   

I am not sure what would have changed from previous versions to current versions of compiler+driver to change the behavior.

 

On another note, I can now run my codes!

 

Thanks to help from Intel folks over e-mail, it turned out it required setting the stack limit to unlimited with "ulimit -s unlimited"

 

I tried the code+run on Stampede3 which has Intel MAX 1550 GPUs and it works fine on both the 2025.2 and the older 2025.0 compilers without needed to modify the stack.

 

So it looks like it is either an issue with the drivers+compiler specifically for the ARC GPUs, or some system issue that only occurred after updating to the newest compiler+driver.

 

I plan to eventually update that system's OS from Ubuntu 22.04 with a zabbly kernel to Ubuntu 24.04 with a standard kernel and will give it a try there without modify the stack limit.

 

In the mean time, I can run the code again so this will work for now (although it runs ~15% slower than before).

 

I will mark this as "accepted solution" for now and post an update whenever I get around to upgrading the system.

 

 - Ron

View solution in original post

0 Kudos
18 Replies
Ron_Green
Moderator
3,742 Views

Could you compile and link with the options "-g -traceback"?   This should provide a stack traceback to a source line.  There may be helpful information in the exact sourceline causing the fault.

And since this is a new failure, can you provide the previous compiler and iMPI versions?  

and since this is offload code, set the environment variable 

LIBOMPTARGET_DEBUG=2

or "=1"

This will give us some information if it is a problem on the device and not the CPU. 

 

0 Kudos
caplanr
New Contributor II
3,414 Views

Hi,

 
I get the following:
 
It seems it cannot find the device?
 
On the smaller runs where it is getting the correct answer and not seg faulting, it also says these messages.
 
 - Ron
 
mpiexec -np 1 hipft
omptarget --> Init offload library!
OMPT --> Entering connectLibrary
OMPT --> OMPT: Trying to load library libiomp5.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7fe9a6116740
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fe9a6103f00 0x00007fe9a6103800
OMPT --> Exiting connectLibrary
omptarget --> Loading RTLs...
omptarget --> Adding all nextgen plugins
omptarget --> Adding nextgen 'level_zero' plugin
omptarget --> Adding nextgen 'host' plugin
omptarget --> RTLs loaded!
TARGET LEVEL_ZERO RTL --> Level0 NG plugin initialization
TARGET LEVEL_ZERO RTL --> ONEAPI_DEVICE_SELECTOR specified 0 root devices
TARGET LEVEL_ZERO RTL -->   (Accept/Discard [T/F] DeviceID[.SubID[.CCSID]]) -2(all), -1(ignore)
TARGET LEVEL_ZERO RTL --> Looking for Level0 devices...
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeInit ( ZE_INIT_FLAG_GPU_ONLY )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeInit (
TARGET LEVEL_ZERO RTL -->     flags = 1
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Trying to load libze_loader.so
TARGET LEVEL_ZERO RTL --> Unable to load library 'libze_loader.so': libze_loader.so: cannot open shared object file: No such file or directory!
TARGET LEVEL_ZERO RTL --> Error: findDevices:zeInit failed with error code 2147483646, ZE_RESULT_ERROR_UNKNOWN
omptarget --> Registered plugin LEVEL_ZERO with 0 visible device(s)
omptarget --> Skipping plugin LEVEL_ZERO with no visible devices
PluginInterface --> Failure to check validity of image 0x326bceb0: Only executable ELF files are supportedomptarget --> No RTL found for image 0x0000000000563ea0!
omptarget --> Done registering entries!
 
 
       _    _ _       ______ _______
      | |  | (_)     |  ____|__   __|
      | |__| |_ _ __ | |__     | |
      |  __  | | '_ \\|  __|    | |
      | |  | | | |_) | |       | |
      |_|  |_|_| .__/|_|       |_|
               | |
               |_|
 
      Version: 1.19.3 of 09/02/2025
 
  ****** HipFT: High Performance Flux Transport.
 
         Authors:  Ronald M. Caplan
                   Miko M. Stulajter
                   Jon A. Linker
                   Zoran Mikic
 
         Predictive Science Inc.
         www.predsci.com
         San Diego, California, USA 92121
 
 
 Number of MPI ranks total:                1
 Number of MPI ranks per node:             1
 
 Run started at:
 
 15 September 2025   4:10:20.277 PM      
 --> Reading input file...
 --> Initializing realization parameters...
 --> Setting up output directories...
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Default TARGET OFFLOAD policy is now disabled (no devices were found)
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
 --> Loading initial condition...
omptarget --> Entering data begin region for device 0 with 54 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050d094
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050fd6a
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data update region for device 0 with 1 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data end region for device 0 with 2 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source            
libc.so.6          00007FE9A1042520  Unknown               Unknown  Unknown
hipft              00000000004278DD  write_map                2869  hipft.f90
hipft              000000000041E11A  load_initial_cond        2185  hipft.f90
hipft              000000000040FA7B  setup                    1221  hipft.f90
hipft              000000000040E8BD  hipft                     869  hipft.f90
hipft              000000000040E72D  Unknown               Unknown  Unknown
libc.so.6          00007FE9A1029D90  Unknown               Unknown  Unknown
libc.so.6          00007FE9A1029E40  __libc_start_main     Unknown  Unknown
hipft              000000000040E645  Unknown               Unknown  Unknown
omptarget --> Unloading target library!
omptarget --> No RTLs in use support the image 0x0000000000563ea0!
omptarget --> Done unregistering images!
omptarget --> Translation table for descriptor 0x00000000326ba860 cannot be found, probably it has been already removed.
omptarget --> Done unregistering library!
omptarget --> Deinit offload library!
omptarget --> Clearing Interop Table
omptarget --> Clearing Async Pending Table
omptarget --> Unloading RTLs...
omptarget --> Clearing Interop Table
omptarget --> Clearing Async Pending Table
TARGET LEVEL_ZERO RTL --> Deinit Level0 plugin!
omptarget --> RTLs unloaded!
0 Kudos
caplanr
New Contributor II
3,704 Views

Hi,

 

My replies seems to be not going through due to the code copy-paste violating some html issue.

 

To summarize, after a long rabbit hole, the flags and env you provided showed that the device was not being detected by the openmp runtime.

I do not know how the small tests worked - maybe it defaulted to CPU?

 

Anyways, I purged everything and re-installed the driver and oneapi.

 

I can now see the device not only in clinfo, but also in sycle-ls (which also used to segfault):

 

sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics [0xe20b] 20.1.0 [1.6.31294+20]
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 7 9700X 8-Core Processor OpenCL 3.0 (Build 0) [2025.20.8.0.06_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics [0xe20b] OpenCL 3.0 NEO [24.39.31294]

 

lspci | grep VGA
0f:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Arc B580]
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] (rev c5)

 

When I run the code now, I see:

 

TARGET LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Graphics [0xe20b]
TARGET LEVEL_ZERO RTL --> Found 1 root devices, 1 total devices.
TARGET LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
TARGET LEVEL_ZERO RTL --> -- 0
TARGET LEVEL_ZERO RTL --> Root Device Information
TARGET LEVEL_ZERO RTL --> Device 0
TARGET LEVEL_ZERO RTL --> -- Name                         : Intel(R) Graphics [0xe20b]
TARGET LEVEL_ZERO RTL --> -- PCI ID                       : 0xe20b

 

so it looks like it is detected, but then the code seg faults (although differently)

 

TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeMemFree ( Context, Info.Base )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeMemFree (
TARGET LEVEL_ZERO RTL -->     hContext = 0x0000000009a0ce30
TARGET LEVEL_ZERO RTL -->     ptr = 0xffffd556af800000
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Deleted device memory 0xffffd556af800000 (Base: 0xffffd556af800000, Size: 33587200)
omptarget --> Notifying about an unmapping: HstPtr=0x00007f1122fe7600
omptarget --> Removing map entry with HstPtrBegin=0x00007ffd43038ae0, TgtPtrBegin=0xffffd556aa3f0d80, Size=120, Name=write_map_$FTMP
omptarget --> Deleting tgt data 0xffffd556aa3f0d80 of size 120 by freeing allocation starting at 0xffffd556aa3f0d80
PluginInterface --> MemoryManagerTy::free: target memory 0xffffd556aa3f0d80.
PluginInterface --> Cannot find its node. Delete it on device directly.
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeMemGetAllocProperties ( getContext(DeviceId), Ptr, &properties, nullptr )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeMemGetAllocProperties (
TARGET LEVEL_ZERO RTL -->     hContext = 0x0000000009a0ce30
TARGET LEVEL_ZERO RTL -->     ptr = 0xffffd556aa3f0d80
TARGET LEVEL_ZERO RTL -->     pMemAllocProperties = 0x00007ffd43037e20
TARGET LEVEL_ZERO RTL -->     phDevice = 0x0000000000000000
TARGET LEVEL_ZERO RTL --> )
omptarget --> Notifying about an unmapping: HstPtr=0x00007ffd43038ae0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 5777 RUNNING AT matana
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

 

In the routine write_map() there is a sequence of:

      allocate (ftmp(ntm,npm,nr))
!$omp target enter data map(alloc:ftmp)
!$omp target update from(ftmp)
!$omp target exit data map(delete:ftmp)
      deallocate (ftmp)

 

I do not see anything wrong with this, and it was working before (and on NVIDIA).

 

Perhaps the seg fault is in code past this routine? 

Or maybe it is still a driver issue?

 

 - Ron

 

0 Kudos
caplanr
New Contributor II
3,414 Views

Running that exposed that there were issues of conflicting intel drivers form ubuntu 22.04 and what oneapi needed for GPU detection.

I am not sure how the smaller tests worked.

 

I went down a rabbit hole with drivers etc, even having to make my own soft link for /lib/x86_64-linux-gnu/libze_loader.so.

 

I can get clinfo to see the card and get an opencl benchmark to run

 

But when I try to run my code I get:

 

flux_transport_1rot_flowAa_diff_r8 $ mpiexec -np 1 hipft
omptarget --> Init offload library!
OMPT --> Entering connectLibrary
OMPT --> OMPT: Trying to load library libiomp5.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7f8490716740
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f8490703f00 0x00007f8490703800
OMPT --> Exiting connectLibrary
omptarget --> Loading RTLs...
omptarget --> Adding all nextgen plugins
omptarget --> Adding nextgen 'level_zero' plugin
omptarget --> Adding nextgen 'host' plugin
omptarget --> RTLs loaded!
TARGET LEVEL_ZERO RTL --> Level0 NG plugin initialization
TARGET LEVEL_ZERO RTL --> ONEAPI_DEVICE_SELECTOR specified 0 root devices
TARGET LEVEL_ZERO RTL --> (Accept/Discard [T/F] DeviceID[.SubID[.CCSID]]) -2(all), -1(ignore)
TARGET LEVEL_ZERO RTL --> Looking for Level0 devices...
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeInit ( ZE_INIT_FLAG_GPU_ONLY )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeInit (
TARGET LEVEL_ZERO RTL --> flags = 1
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Trying to load libze_loader.so
TARGET LEVEL_ZERO RTL --> Implementing zeInit with dlsym(zeInit) -> 0x7f848b85dc20
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGet with dlsym(zeDriverGet) -> 0x7f848b85dd20
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGet with dlsym(zeDeviceGet) -> 0x7f848b85e140
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetSubDevices with dlsym(zeDeviceGetSubDevices) -> 0x7f848b85e200
TARGET LEVEL_ZERO RTL --> Implementing zeModuleCreate with dlsym(zeModuleCreate) -> 0x7f848b860560
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetProperties with dlsym(zeModuleGetProperties) -> 0x7f848b860860
TARGET LEVEL_ZERO RTL --> Implementing zeModuleBuildLogDestroy with dlsym(zeModuleBuildLogDestroy) -> 0x7f848b860680
TARGET LEVEL_ZERO RTL --> Implementing zeModuleBuildLogGetString with dlsym(zeModuleBuildLogGetString) -> 0x7f848b8606e0
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetKernelNames with dlsym(zeModuleGetKernelNames) -> 0x7f848b860800
TARGET LEVEL_ZERO RTL --> Implementing zeModuleDestroy with dlsym(zeModuleDestroy) -> 0x7f848b8605c0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendBarrier with dlsym(zeCommandListAppendBarrier) -> 0x7f848b85ef80
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendLaunchKernel with dlsym(zeCommandListAppendLaunchKernel) -> 0x7f848b860da0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendLaunchCooperativeKernel with dlsym(zeCommandListAppendLaunchCooperativeKernel) -> 0x7f848b860e00
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryCopy with dlsym(zeCommandListAppendMemoryCopy) -> 0x7f848b85f0a0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryCopyRegion with dlsym(zeCommandListAppendMemoryCopyRegion) -> 0x7f848b85f170
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryFill with dlsym(zeCommandListAppendMemoryFill) -> 0x7f848b85f100
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryPrefetch with dlsym(zeCommandListAppendMemoryPrefetch) -> 0x7f848b85f410
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemAdvise with dlsym(zeCommandListAppendMemAdvise) -> 0x7f848b85f470
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListClose with dlsym(zeCommandListClose) -> 0x7f848b85ec20
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListCreate with dlsym(zeCommandListCreate) -> 0x7f848b85eb00
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListCreateImmediate with dlsym(zeCommandListCreateImmediate) -> 0x7f848b85eb60
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListDestroy with dlsym(zeCommandListDestroy) -> 0x7f848b85ebc0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListReset with dlsym(zeCommandListReset) -> 0x7f848b85ec80
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueCreate with dlsym(zeCommandQueueCreate) -> 0x7f848b85e8c0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueDestroy with dlsym(zeCommandQueueDestroy) -> 0x7f848b85e920
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueExecuteCommandLists with dlsym(zeCommandQueueExecuteCommandLists) -> 0x7f848b85e980
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueSynchronize with dlsym(zeCommandQueueSynchronize) -> 0x7f848b85e9e0
TARGET LEVEL_ZERO RTL --> Implementing zeContextCreate with dlsym(zeContextCreate) -> 0x7f848b85e740
TARGET LEVEL_ZERO RTL --> Implementing zeContextDestroy with dlsym(zeContextDestroy) -> 0x7f848b85e800
TARGET LEVEL_ZERO RTL --> Implementing zeContextMakeMemoryResident with dlsym(zeContextMakeMemoryResident) -> 0x7f848b860f30
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceCanAccessPeer with dlsym(zeDeviceCanAccessPeer) -> 0x7f848b85e620
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetProperties with dlsym(zeDeviceGetProperties) -> 0x7f848b85e260
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetCommandQueueGroupProperties with dlsym(zeDeviceGetCommandQueueGroupProperties) -> 0x7f848b85e380
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetComputeProperties with dlsym(zeDeviceGetComputeProperties) -> 0x7f848b85e2c0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetMemoryProperties with dlsym(zeDeviceGetMemoryProperties) -> 0x7f848b85e3e0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetCacheProperties with dlsym(zeDeviceGetCacheProperties) -> 0x7f848b85e4a0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetGlobalTimestamps with dlsym(zeDeviceGetGlobalTimestamps) -> 0x7f848b85e6e0
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetApiVersion with dlsym(zeDriverGetApiVersion) -> 0x7f848b85df00
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetExtensionFunctionAddress with dlsym(zeDriverGetExtensionFunctionAddress) -> 0x7f848b85e080
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetExtensionProperties with dlsym(zeDriverGetExtensionProperties) -> 0x7f848b85e020
TARGET LEVEL_ZERO RTL --> Implementing zeEventCreate with dlsym(zeEventCreate) -> 0x7f848b85f590
TARGET LEVEL_ZERO RTL --> Implementing zeEventDestroy with dlsym(zeEventDestroy) -> 0x7f848b85f5f0
TARGET LEVEL_ZERO RTL --> Implementing zeEventHostReset with dlsym(zeEventHostReset) -> 0x7f848b85fa10
TARGET LEVEL_ZERO RTL --> Implementing zeEventHostSynchronize with dlsym(zeEventHostSynchronize) -> 0x7f848b85f8f0
TARGET LEVEL_ZERO RTL --> Implementing zeEventPoolCreate with dlsym(zeEventPoolCreate) -> 0x7f848b85f4d0
TARGET LEVEL_ZERO RTL --> Implementing zeEventPoolDestroy with dlsym(zeEventPoolDestroy) -> 0x7f848b85f530
TARGET LEVEL_ZERO RTL --> Implementing zeEventQueryKernelTimestamp with dlsym(zeEventQueryKernelTimestamp) -> 0x7f848b85fa70
TARGET LEVEL_ZERO RTL --> Implementing zeFenceCreate with dlsym(zeFenceCreate) -> 0x7f848b85fd20
TARGET LEVEL_ZERO RTL --> Implementing zeFenceDestroy with dlsym(zeFenceDestroy) -> 0x7f848b85fd80
TARGET LEVEL_ZERO RTL --> Implementing zeFenceHostSynchronize with dlsym(zeFenceHostSynchronize) -> 0x7f848b85fde0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelCreate with dlsym(zeKernelCreate) -> 0x7f848b8608c0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelDestroy with dlsym(zeKernelDestroy) -> 0x7f848b860920
TARGET LEVEL_ZERO RTL --> Implementing zeKernelGetName with dlsym(zeKernelGetName) -> 0x7f848b860d40
TARGET LEVEL_ZERO RTL --> Implementing zeKernelGetProperties with dlsym(zeKernelGetProperties) -> 0x7f848b860ce0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetArgumentValue with dlsym(zeKernelSetArgumentValue) -> 0x7f848b860b00
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetGroupSize with dlsym(zeKernelSetGroupSize) -> 0x7f848b8609e0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetIndirectAccess with dlsym(zeKernelSetIndirectAccess) -> 0x7f848b860b60
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSuggestGroupSize with dlsym(zeKernelSuggestGroupSize) -> 0x7f848b860a40
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSuggestMaxCooperativeGroupCount with dlsym(zeKernelSuggestMaxCooperativeGroupCount) -> 0x7f848b860aa0
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocDevice with dlsym(zeMemAllocDevice) -> 0x7f848b860080
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocHost with dlsym(zeMemAllocHost) -> 0x7f848b8600e0
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocShared with dlsym(zeMemAllocShared) -> 0x7f848b860020
TARGET LEVEL_ZERO RTL --> Implementing zeMemFree with dlsym(zeMemFree) -> 0x7f848b860140
TARGET LEVEL_ZERO RTL --> Implementing zeMemGetAddressRange with dlsym(zeMemGetAddressRange) -> 0x7f848b860200
TARGET LEVEL_ZERO RTL --> Implementing zeMemGetAllocProperties with dlsym(zeMemGetAllocProperties) -> 0x7f848b8601a0
TARGET LEVEL_ZERO RTL --> Implementing zeModuleDynamicLink with dlsym(zeModuleDynamicLink) -> 0x7f848b860620
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetGlobalPointer with dlsym(zeModuleGetGlobalPointer) -> 0x7f848b8607a0
TARGET LEVEL_ZERO RTL --> Implementing zesDeviceEnumMemoryModules with dlsym(zesDeviceEnumMemoryModules) -> 0x7f848b88cea0
TARGET LEVEL_ZERO RTL --> Implementing zesMemoryGetState with dlsym(zesMemoryGetState) -> 0x7f848b88cf60
TARGET LEVEL_ZERO RTL --> Error: findDevices:zeInit failed with error code 2013265921, ZE_RESULT_ERROR_UNINITIALIZED
omptarget --> Registered plugin LEVEL_ZERO with 0 visible device(s)
omptarget --> Skipping plugin LEVEL_ZERO with no visible devices
PluginInterface --> Failure to check validity of image 0x28821eb0: Only executable ELF files are supportedomptarget --> No RTL found for image 0x0000000000563ea0!
omptarget --> Done registering entries!


_ _ _ ______ _______
| | | (_) | ____|__ __|
| |__| |_ _ __ | |__ | |
| __ | | '_ \\| __| | |
| | | | | |_) | | | |
|_| |_|_| .__/|_| |_|
| |
|_|

Version: 1.19.3 of 09/02/2025

****** HipFT: High Performance Flux Transport.

Authors: Ronald M. Caplan
Miko M. Stulajter
Jon A. Linker
Zoran Mikic

Predictive Science Inc.
www.predsci.com
San Diego, California, USA 92121


Number of MPI ranks total: 1
Number of MPI ranks per node: 1

Run started at:

15 September 2025 4:38:04.283 PM
--> Reading input file...
--> Initializing realization parameters...
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Default TARGET OFFLOAD policy is now disabled (no devices were found)
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
--> Setting up output directories...
--> Loading initial condition...
omptarget --> Entering data begin region for device 0 with 54 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050d094
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050fd6a
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data update region for device 0 with 1 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data end region for device 0 with 2 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 3439053 RUNNING AT matana
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

 

 

Is there a guide to wipe my intel drivers and oneapi clean and start from scratch for both the driver and oneapi for Ubuntu 22.04?

 

 

 - Ron

0 Kudos
Ron_Green
Moderator
3,442 Views

I always suspect the allocate.  How is ftmp declared, is it:

real ( REAL64 ), allocatable :: ftmp(:,:,:) 

and what are the vaules of ntm, npm, nr

 

trying to gauge the size of the allocation request

0 Kudos
caplanr
New Contributor II
3,379 Views

Hi,

 

It is declared as:

real(r_typ), dimension(:,:,:), allocatable :: ftmp

where

integer, parameter :: r_typ = REAL64

 

The sizes for this example run are:  (512,1024,8)  (32MB)

 

 - Ron

0 Kudos
Ron_Green
Moderator
3,313 Views

I looked up the B580 and I found it has 12 GB GDDR6 ?? 

https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html

 

I don't have a B580.  I have an older Integrated Graphics 630.  

Before running, I set

export LIBOMPTARGET_PLUGIN_PROFILE=T

Download the example 'vecadd.f90'.  i tried to set up an example that may mimic what you described. 

Try scaling up the array ftmp.  Mine is 10x10x10.  Try your sizes, see if this is working. 

 

$ !export
export LIBOMPTARGET_PLUGIN_PROFILE=T

$ ifx -what -V -O2 -xhost -qopenmp -fopenmp-targets=spir64 -I./ vecadd.f90 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.

 Intel(R) Fortran 25.0-1485
 Intel(R) Fortran 25.0-1485
 Intel(R) Fortran 25.0-1485
GNU ld (GNU Binutils for Ubuntu) 2.38

$ ./a.out
 ntm           10
 npm           10
 nr           10
 calling vecadd
 Host side allocation success 
 enter data map allocation complete
 ftmp initialized on target
 ftmp updated on target
 done with vecadd
 answer to it all    42.0000000000000     
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_80_4b0d0412_veclib_mp_init__l26
Kernel 1                  : __omp_offloading_80_4b0d0412_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)                      
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :      90.42     90.42     90.42     90.42      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       0.11      0.01      0.00      0.05      0.00      0.00      0.00      0.00     16.00
DataRead (Device to Host) :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      2.00
DataWrite (Host to Device):       0.12      0.01      0.00      0.12      0.01      0.01      0.01      0.01      9.00
Kernel 0                  :       0.16      0.16      0.16      0.16      0.02      0.02      0.02      0.02      1.00
Kernel 1                  :       0.25      0.25      0.25      0.25      0.08      0.08      0.08      0.08      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       2.31      2.31      2.31      2.31      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

 

and the code (attached as well)

module veclib
  use omp_lib
  use ISO_FORTRAN_ENV
contains
subroutine init(ftmp, ntm, npm, nr)
   implicit none
   real (REAL64), allocatable :: ftmp( :,:,: )
   integer, intent(in) :: ntm, npm, nr
   !...locals
   integer :: i,j,k, allocate_status
   character(200) :: error_message
 
  !...host side allocation 
  allocate(ftmp(ntm,npm,nr),stat=allocate_status, errmsg=error_message)
  if (allocate_status > 0 ) then
    write(*,*) "Host side allocation failed ", error_message
  else
    write(*,*) "Host side allocation success "
  end if

  !...device allocation
  !$omp target enter data map(alloc:ftmp) 
  write(*,*) "enter data map allocation complete"

  !...initialize ftmp on device
  !$omp target teams distribute parallel do map(to: ntm, npm, nr ) map(present, from: ftmp)
  do k=1,nr
    do j=1,npm
      do i=1,ntm    
        ftmp(i,j,k) = 21.0_REAL64
      end do
    end do
  end do
  !$omp end target teams distribute parallel do
  write(*,*) "ftmp initialized on target" 
end subroutine init

!dir$ attributes noinline :: vecadd
subroutine vecadd(ftmp, ntm, npm, nr)
   real (REAL64), allocatable :: ftmp( :,:,: )
   integer, intent(in) :: ntm, npm, nr
   integer i,j,k

   call init(ftmp,ntm,npm,nr)
   !$omp target teams distribute parallel do map(to: ntm, npm, nr ) map(present, from: ftmp)
   do k=1,nr
    do j=1,npm
      do i=1,ntm
        ftmp(i,j,k) = ftmp(i,j,k) + 21.0_REAL64
      end do
    end do
   end do
  !$omp end target teams distribute parallel do
  write(*,*) "ftmp updated on target"

end subroutine vecadd
end module veclib

program  vectest
  use ISO_FORTRAN_ENV
  use omp_lib
  use veclib
  implicit none
  integer :: ntm=10
  integer :: npm=10
  integer :: nr=10
  real (REAL64), allocatable :: ftmp( :,:,: )

  print*, "ntm ", ntm
  print*, "npm ", npm
  print*, "nr ", nr
  print*, "calling vecadd"
  call vecadd(ftmp,ntm,npm,nr)
  print*, "done with vecadd"

  !$omp target update from(ftmp)
  !$omp target exit data map(from:ftmp) 

  write(*,*) "answer to it all ", ftmp(ntm,npm,nr) 
end program vectest
0 Kudos
caplanr
New Contributor II
3,310 Views

That works fine (even at the higher resolution):

 

MATANA_GPU_INTEL: ~/intel_test $ ./a.out 
 ntm           10
 npm           10
 nr           10
 calling vecadd
 Host side allocation success 
 enter data map allocation complete
 ftmp initialized on target
 ftmp updated on target
 done with vecadd
 answer to it all    42.0000000000000     
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Graphics [0xe20b], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_10303_d65020_veclib_mp_init__l26
Kernel 1                  : __omp_offloading_10303_d65020_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)                      
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :      34.27     34.27     34.27     34.27      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       5.21      0.33      0.00      1.74      0.00      0.00      0.00      0.00     16.00
DataRead (Device to Host) :       0.46      0.23      0.11      0.35      0.01      0.01      0.01      0.01      2.00
DataWrite (Host to Device):       4.63      0.51      0.20      2.28      0.11      0.01      0.01      0.02      9.00
Kernel 0                  :       1.50      1.50      1.50      1.50      0.05      0.05      0.05      0.05      1.00
Kernel 1                  :       0.53      0.53      0.53      0.53      0.11      0.11      0.11      0.11      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       3.29      3.29      3.29      3.29      0.00      0.00      0.00      0.00      1.00
======================================================================================================================
MATANA_GPU_INTEL: ~/intel_test $ vim test.f90 
MATANA_GPU_INTEL: ~/intel_test $ ifx -what -V -O2 -xhost -qopenmp -fopenmp-targets=spir64 -I./ test.f90 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.1 Build 20250806
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.

 Intel(R) Fortran 25.0-1485
 Intel(R) Fortran 25.0-1485
 Intel(R) Fortran 25.0-1485
GNU ld (GNU Binutils for Ubuntu) 2.38
MATANA_GPU_INTEL: ~/intel_test $ ./a.out 
 ntm          512
 npm         1024
 nr            8
 calling vecadd
 Host side allocation success 
 enter data map allocation complete
 ftmp initialized on target
 ftmp updated on target
 done with vecadd
 answer to it all    42.0000000000000     
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Graphics [0xe20b], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_10303_d65021_veclib_mp_init__l26
Kernel 1                  : __omp_offloading_10303_d65021_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)                      
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :      33.92     33.92     33.92     33.92      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       6.62      0.41      0.00      2.16      0.00      0.00      0.00      0.00     16.00
DataRead (Device to Host) :      84.66     42.33     40.86     43.80     80.01     40.00     40.00     40.00      2.00
DataWrite (Host to Device):       2.02      0.22      0.07      0.40      0.10      0.01      0.00      0.01      9.00
Kernel 0                  :      54.93     54.93     54.93     54.93     53.32     53.32     53.32     53.32      1.00
Kernel 1                  :      68.53     68.53     68.53     68.53     68.27     68.27     68.27     68.27      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       3.30      3.30      3.30      3.30      0.00      0.00      0.00      0.00      1.00
=======================================================================================================

 

0 Kudos
caplanr
New Contributor II
3,298 Views

Hi,

 

I can reproduce the issue with the following code:

 

MATANA_GPU_INTEL: ~/intel_test $ cat test.f90 
module veclib
  use omp_lib
  use ISO_FORTRAN_ENV
contains
subroutine init(ftmp, ntm, npm, nr)
   implicit none
   real (REAL64), allocatable :: ftmp( :,:,: )
   integer, intent(in) :: ntm, npm, nr
   !...locals
   integer :: i,j,k, allocate_status
   character(200) :: error_message
 
  !...host side allocation 
  allocate(ftmp(ntm,npm,nr),stat=allocate_status, errmsg=error_message)
  if (allocate_status > 0 ) then
    write(*,*) "Host side allocation failed ", error_message
  else
    write(*,*) "Host side allocation success "
  end if

  !...device allocation
  !$omp target enter data map(alloc:ftmp) 
  write(*,*) "enter data map allocation complete"

  !...initialize ftmp on device
  do concurrent (k=1:nr,j=1:npm,i=1:ntm)
    ftmp(i,j,k) = 21.0_REAL64
  end do
  write(*,*) "ftmp initialized on target" 
end subroutine init

!dir$ attributes noinline :: vecadd
subroutine vecadd(ftmp, ntm, npm, nr)
   real (REAL64), allocatable :: ftmp( :,:,: )
   integer, intent(in) :: ntm, npm, nr
   integer i,j,k

   call init(ftmp,ntm,npm,nr)
   do concurrent (k=1:nr,j=1:npm,i=1:ntm)
     ftmp(i,j,k) = ftmp(i,j,k) + 21.0_REAL64
   end do
  write(*,*) "ftmp updated on target"

end subroutine vecadd
end module veclib

program  vectest
  use ISO_FORTRAN_ENV
  use omp_lib
  use veclib
  implicit none
  integer :: ntm=512
  integer :: npm=1024
  integer :: nr=8
  real (REAL64), allocatable :: ftmp( :,:,: )

  print*, "ntm ", ntm
  print*, "npm ", npm
  print*, "nr ", nr
  print*, "calling vecadd"
  call vecadd(ftmp,ntm,npm,nr)
  print*, "done with vecadd"

  !$omp target update from(ftmp)
  !$omp target exit data map(delete:ftmp) 
  deallocate (ftmp)

  write(*,*) "answer to it all ", ftmp(ntm,npm,nr) 
end program vectest

 

I compile with

ifx -what -V -O3 -xhost -fp-model precise -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -fopenmp-do-concurrent-maptype-modifier=present  -I./ test.f90

 

The seg fault happens with the deallocate(ftmp).

If that is removed, the code works.

The deallocate should be fine since that should be on the CPU as the openmp target exit data map(delete:ftmp) should only deallocate on the device.

 

I believe this is where the problem is.

 - Ron

 

 

0 Kudos
Ron_Green
Moderator
3,301 Views

And wrap this with your mpiexec -np 1 ./a.out 

 

I am not skilled to read the debug output.  If this simple example works, then I'll escalate your debug output from the failed run and see if a driver developer can explain what happened. 

0 Kudos
caplanr
New Contributor II
3,251 Views

Sorry - I put the deallocate before the array was used in the print.

 

Moving it at the end makes the test code still work.

 

So I am still at a loss for my code.

 

 

 

0 Kudos
caplanr
New Contributor II
3,251 Views

(the test code works with mpif90 -f90=ifx and mpiexec -np 1 as well BTW)

0 Kudos
Ron_Green
Moderator
3,156 Views

Harald and I were investigating this. OH I see you also found what I found!

 

We are finding inconsistent behaviors that we cannot explain.  What I did was to change my example.  I changed the target exit map from:

!$omp target exit data map(from:ftmp)
to what you use also which is:
!$omp target exit data map(delete:ftmp)

Now I get seg faults also on this map delete.  But only under certain circumstances that I cannot replicate or explain.  Sometime -O0 and -O1 will work.  And O2 fails.  sometimes even O2 fails after a build. And this is on an old integrated graphics GPU.  

So for sure there is some bug here.  The drivers are very different from BMG.  I will work with our OMP Offload person in the Fortran Front-end team to see if we are creating bad calls to the map exit delete. 

 

We will continue to isolate this, but we are now reproducing what you were seeing. 

0 Kudos
Ron_Green
Moderator
2,692 Views

It was reported to me offline that unlimiting stack solved the problem.  @caplanr can you confirm?  
After unlimiting stack this example is working consistently for me. 

I have an outstanding question as to why map(delete:ftmp) would need stack space.  I would think it should be a simple call to free the memory on the device without any need for stack.  I will ask the development team about this.

0 Kudos
caplanr
New Contributor II
2,650 Views

Hi,

 

Yes it worked but was slow.

 

Also, when I CTRL+C to exit the run it seg faulted.

 

Others at Intel are looking into it - I will keep you up to date.

 

 - Ron

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,082 Views

Ron,

I do not have a system that can duplicate (run) this example. What comes to mind is heap corruption within the GPU, and where the error is not symptomatic until the delete. An unfounded suspicion I have is array temporaries:

 

Array = ArrayExpressionUsingAtemporary ! and/or a reallocLHS

 

From earlier post, the error crept in when the model memory requirements got larger.

 

In the particular case of reallocLHS, the named identities for ftmp on both sides (CPU and GPU) would need to be updated. Again this is a supposition as I do not know the peculiarities of how the ftmp identities between the CPU and GPU are performed. IIF the identities are maintained by the base address of the allocations then a reallocLHS inside the GPU would not update the identity in the CPU (as it would not be appropriate to do so. But then when the CPU then issues a delete, it would NOT have a valid address (for use as a handle within the GPU).

 

Jim Dempsey

0 Kudos
caplanr
New Contributor II
1,889 Views

HI,

 

That sounds like a reasonable explanation.   

I am not sure what would have changed from previous versions to current versions of compiler+driver to change the behavior.

 

On another note, I can now run my codes!

 

Thanks to help from Intel folks over e-mail, it turned out it required setting the stack limit to unlimited with "ulimit -s unlimited"

 

I tried the code+run on Stampede3 which has Intel MAX 1550 GPUs and it works fine on both the 2025.2 and the older 2025.0 compilers without needed to modify the stack.

 

So it looks like it is either an issue with the drivers+compiler specifically for the ARC GPUs, or some system issue that only occurred after updating to the newest compiler+driver.

 

I plan to eventually update that system's OS from Ubuntu 22.04 with a zabbly kernel to Ubuntu 24.04 with a standard kernel and will give it a try there without modify the stack limit.

 

In the mean time, I can run the code again so this will work for now (although it runs ~15% slower than before).

 

I will mark this as "accepted solution" for now and post an update whenever I get around to upgrading the system.

 

 - Ron

0 Kudos
caplanr
New Contributor II
381 Views

Additional note:

 

This stack issue can also be avoided by using the "-heap-arrays" flag instead of setting the stack to unlimited.

This is what I usually do with IFX, but had forgotten the flag which led to the problem.

 

 - Ron

0 Kudos
Reply