- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I recently updated my Intel driver and OneAPI HPC SDK.
I am running on an Intel Arc B580 GPU.
My system has Ubuntu 22.04 with kernel: 6.13.12-zabbly+
My IFX version is: ifx (IFX) 2025.2.1 20250806
My MPI is /opt/intel/oneapi/mpi/2021.16
I am running the code HipFT which can be obtained here: github.com/predsci/hipft
I am using the latest commit.
The compiler flags I am using are:
-O3 -xHost -fp-model precise -fopenmp-target-do-concurrent -fiopenmp
-fopenmp-targets=spir64 -fopenmp-do-concurrent-maptype-modifier=present
The code uses Fortran's "do concurrent" to offload computation to the GPU, and uses OpenMP Target directives to manually manage the CPU-GPU data transfers. See https://ieeexplore.ieee.org/document/10820592 for details.
The code compiles fine.
The code runs correctly for the testsuite runs included in the git repo (these are small runs).
However, for the example run in "/examples/flux_transport_1rot_flowAa_diff_r8" the code seg faults with:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F2C41042520 Unknown Unknown Unknown
hipft 000000000042821D Unknown Unknown Unknown
hipft 000000000041E8F4 Unknown Unknown Unknown
hipft 000000000040FAFE Unknown Unknown Unknown
hipft 000000000040E8BA Unknown Unknown Unknown
hipft 000000000040E72D Unknown Unknown Unknown
libc.so.6 00007F2C41029D90 Unknown Unknown Unknown
libc.so.6 00007F2C41029E40 __libc_start_main Unknown Unknown
hipft 000000000040E645 Unknown Unknown Unknown
This run uses more VRAM than the testsuite runs, but otherwise is running the same algorithms.
The run used to work as shown in the paper linked above (it also works on NVIDIA GPUs with the same amount of VRAM).
To reproduce this, compile the hdf5 library with the IFX compiler and then set the paths to the library in the build configuration file (see conf/intel_gpu_psi.conf for an example).
Then, build the code with "build.sh <CONFFILE>" and then the code can be run in the examples/flux_transport_1rot_flowAa_diff_r8 directory with "mpiexec -np 1 ../bin/hipft "
Thanks!
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI,
That sounds like a reasonable explanation.
I am not sure what would have changed from previous versions to current versions of compiler+driver to change the behavior.
On another note, I can now run my codes!
Thanks to help from Intel folks over e-mail, it turned out it required setting the stack limit to unlimited with "ulimit -s unlimited"
I tried the code+run on Stampede3 which has Intel MAX 1550 GPUs and it works fine on both the 2025.2 and the older 2025.0 compilers without needed to modify the stack.
So it looks like it is either an issue with the drivers+compiler specifically for the ARC GPUs, or some system issue that only occurred after updating to the newest compiler+driver.
I plan to eventually update that system's OS from Ubuntu 22.04 with a zabbly kernel to Ubuntu 24.04 with a standard kernel and will give it a try there without modify the stack limit.
In the mean time, I can run the code again so this will work for now (although it runs ~15% slower than before).
I will mark this as "accepted solution" for now and post an update whenever I get around to upgrading the system.
- Ron
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you compile and link with the options "-g -traceback"? This should provide a stack traceback to a source line. There may be helpful information in the exact sourceline causing the fault.
And since this is a new failure, can you provide the previous compiler and iMPI versions?
and since this is offload code, set the environment variable
LIBOMPTARGET_DEBUG=2
or "=1"
This will give us some information if it is a problem on the device and not the CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
omptarget --> Init offload library!
OMPT --> Entering connectLibrary
OMPT --> OMPT: Trying to load library libiomp5.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7fe9a6116740
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007fe9a6103f00 0x00007fe9a6103800
OMPT --> Exiting connectLibrary
omptarget --> Loading RTLs...
omptarget --> Adding all nextgen plugins
omptarget --> Adding nextgen 'level_zero' plugin
omptarget --> Adding nextgen 'host' plugin
omptarget --> RTLs loaded!
TARGET LEVEL_ZERO RTL --> Level0 NG plugin initialization
TARGET LEVEL_ZERO RTL --> ONEAPI_DEVICE_SELECTOR specified 0 root devices
TARGET LEVEL_ZERO RTL --> (Accept/Discard [T/F] DeviceID[.SubID[.CCSID]]) -2(all), -1(ignore)
TARGET LEVEL_ZERO RTL --> Looking for Level0 devices...
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeInit ( ZE_INIT_FLAG_GPU_ONLY )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeInit (
TARGET LEVEL_ZERO RTL --> flags = 1
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Trying to load libze_loader.so
TARGET LEVEL_ZERO RTL --> Unable to load library 'libze_loader.so': libze_loader.so: cannot open shared object file: No such file or directory!
TARGET LEVEL_ZERO RTL --> Error: findDevices:zeInit failed with error code 2147483646, ZE_RESULT_ERROR_UNKNOWN
omptarget --> Registered plugin LEVEL_ZERO with 0 visible device(s)
omptarget --> Skipping plugin LEVEL_ZERO with no visible devices
PluginInterface --> Failure to check validity of image 0x326bceb0: Only executable ELF files are supportedomptarget --> No RTL found for image 0x0000000000563ea0!
omptarget --> Done registering entries!
_ _ _ ______ _______
| | | (_) | ____|__ __|
| |__| |_ _ __ | |__ | |
| __ | | '_ \\| __| | |
| | | | | |_) | | | |
|_| |_|_| .__/|_| |_|
| |
|_|
Version: 1.19.3 of 09/02/2025
****** HipFT: High Performance Flux Transport.
Authors: Ronald M. Caplan
Miko M. Stulajter
Jon A. Linker
Zoran Mikic
Predictive Science Inc.
www.predsci.com
San Diego, California, USA 92121
Number of MPI ranks total: 1
Number of MPI ranks per node: 1
Run started at:
15 September 2025 4:10:20.277 PM
--> Reading input file...
--> Initializing realization parameters...
--> Setting up output directories...
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Default TARGET OFFLOAD policy is now disabled (no devices were found)
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
--> Loading initial condition...
omptarget --> Entering data begin region for device 0 with 54 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050d094
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050fd6a
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data update region for device 0 with 1 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data end region for device 0 with 2 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007FE9A1042520 Unknown Unknown Unknown
hipft 00000000004278DD write_map 2869 hipft.f90
hipft 000000000041E11A load_initial_cond 2185 hipft.f90
hipft 000000000040FA7B setup 1221 hipft.f90
hipft 000000000040E8BD hipft 869 hipft.f90
hipft 000000000040E72D Unknown Unknown Unknown
libc.so.6 00007FE9A1029D90 Unknown Unknown Unknown
libc.so.6 00007FE9A1029E40 __libc_start_main Unknown Unknown
hipft 000000000040E645 Unknown Unknown Unknown
omptarget --> Unloading target library!
omptarget --> No RTLs in use support the image 0x0000000000563ea0!
omptarget --> Done unregistering images!
omptarget --> Translation table for descriptor 0x00000000326ba860 cannot be found, probably it has been already removed.
omptarget --> Done unregistering library!
omptarget --> Deinit offload library!
omptarget --> Clearing Interop Table
omptarget --> Clearing Async Pending Table
omptarget --> Unloading RTLs...
omptarget --> Clearing Interop Table
omptarget --> Clearing Async Pending Table
TARGET LEVEL_ZERO RTL --> Deinit Level0 plugin!
omptarget --> RTLs unloaded!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
My replies seems to be not going through due to the code copy-paste violating some html issue.
To summarize, after a long rabbit hole, the flags and env you provided showed that the device was not being detected by the openmp runtime.
I do not know how the small tests worked - maybe it defaulted to CPU?
Anyways, I purged everything and re-installed the driver and oneapi.
I can now see the device not only in clinfo, but also in sycle-ls (which also used to segfault):
sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics [0xe20b] 20.1.0 [1.6.31294+20]
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 7 9700X 8-Core Processor OpenCL 3.0 (Build 0) [2025.20.8.0.06_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics [0xe20b] OpenCL 3.0 NEO [24.39.31294]
lspci | grep VGA
0f:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Arc B580]
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] (rev c5)
When I run the code now, I see:
TARGET LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Graphics [0xe20b]
TARGET LEVEL_ZERO RTL --> Found 1 root devices, 1 total devices.
TARGET LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
TARGET LEVEL_ZERO RTL --> -- 0
TARGET LEVEL_ZERO RTL --> Root Device Information
TARGET LEVEL_ZERO RTL --> Device 0
TARGET LEVEL_ZERO RTL --> -- Name : Intel(R) Graphics [0xe20b]
TARGET LEVEL_ZERO RTL --> -- PCI ID : 0xe20b
so it looks like it is detected, but then the code seg faults (although differently)
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeMemFree ( Context, Info.Base )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeMemFree (
TARGET LEVEL_ZERO RTL --> hContext = 0x0000000009a0ce30
TARGET LEVEL_ZERO RTL --> ptr = 0xffffd556af800000
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Deleted device memory 0xffffd556af800000 (Base: 0xffffd556af800000, Size: 33587200)
omptarget --> Notifying about an unmapping: HstPtr=0x00007f1122fe7600
omptarget --> Removing map entry with HstPtrBegin=0x00007ffd43038ae0, TgtPtrBegin=0xffffd556aa3f0d80, Size=120, Name=write_map_$FTMP
omptarget --> Deleting tgt data 0xffffd556aa3f0d80 of size 120 by freeing allocation starting at 0xffffd556aa3f0d80
PluginInterface --> MemoryManagerTy::free: target memory 0xffffd556aa3f0d80.
PluginInterface --> Cannot find its node. Delete it on device directly.
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeMemGetAllocProperties ( getContext(DeviceId), Ptr, &properties, nullptr )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeMemGetAllocProperties (
TARGET LEVEL_ZERO RTL --> hContext = 0x0000000009a0ce30
TARGET LEVEL_ZERO RTL --> ptr = 0xffffd556aa3f0d80
TARGET LEVEL_ZERO RTL --> pMemAllocProperties = 0x00007ffd43037e20
TARGET LEVEL_ZERO RTL --> phDevice = 0x0000000000000000
TARGET LEVEL_ZERO RTL --> )
omptarget --> Notifying about an unmapping: HstPtr=0x00007ffd43038ae0
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 5777 RUNNING AT matana
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
In the routine write_map() there is a sequence of:
allocate (ftmp(ntm,npm,nr))
!$omp target enter data map(alloc:ftmp)!$omp target update from(ftmp)!$omp target exit data map(delete:ftmp)
deallocate (ftmp)
I do not see anything wrong with this, and it was working before (and on NVIDIA).
Perhaps the seg fault is in code past this routine?
Or maybe it is still a driver issue?
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running that exposed that there were issues of conflicting intel drivers form ubuntu 22.04 and what oneapi needed for GPU detection.
I am not sure how the smaller tests worked.
I went down a rabbit hole with drivers etc, even having to make my own soft link for /lib/x86_64-linux-gnu/libze_loader.so.
I can get clinfo to see the card and get an opencl benchmark to run
But when I try to run my code I get:
flux_transport_1rot_flowAa_diff_r8 $ mpiexec -np 1 hipft
omptarget --> Init offload library!
OMPT --> Entering connectLibrary
OMPT --> OMPT: Trying to load library libiomp5.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7f8490716740
omptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007f8490703f00 0x00007f8490703800
OMPT --> Exiting connectLibrary
omptarget --> Loading RTLs...
omptarget --> Adding all nextgen plugins
omptarget --> Adding nextgen 'level_zero' plugin
omptarget --> Adding nextgen 'host' plugin
omptarget --> RTLs loaded!
TARGET LEVEL_ZERO RTL --> Level0 NG plugin initialization
TARGET LEVEL_ZERO RTL --> ONEAPI_DEVICE_SELECTOR specified 0 root devices
TARGET LEVEL_ZERO RTL --> (Accept/Discard [T/F] DeviceID[.SubID[.CCSID]]) -2(all), -1(ignore)
TARGET LEVEL_ZERO RTL --> Looking for Level0 devices...
TARGET LEVEL_ZERO RTL --> ZE_CALLER: zeInit ( ZE_INIT_FLAG_GPU_ONLY )
TARGET LEVEL_ZERO RTL --> ZE_CALLEE: zeInit (
TARGET LEVEL_ZERO RTL --> flags = 1
TARGET LEVEL_ZERO RTL --> )
TARGET LEVEL_ZERO RTL --> Trying to load libze_loader.so
TARGET LEVEL_ZERO RTL --> Implementing zeInit with dlsym(zeInit) -> 0x7f848b85dc20
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGet with dlsym(zeDriverGet) -> 0x7f848b85dd20
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGet with dlsym(zeDeviceGet) -> 0x7f848b85e140
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetSubDevices with dlsym(zeDeviceGetSubDevices) -> 0x7f848b85e200
TARGET LEVEL_ZERO RTL --> Implementing zeModuleCreate with dlsym(zeModuleCreate) -> 0x7f848b860560
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetProperties with dlsym(zeModuleGetProperties) -> 0x7f848b860860
TARGET LEVEL_ZERO RTL --> Implementing zeModuleBuildLogDestroy with dlsym(zeModuleBuildLogDestroy) -> 0x7f848b860680
TARGET LEVEL_ZERO RTL --> Implementing zeModuleBuildLogGetString with dlsym(zeModuleBuildLogGetString) -> 0x7f848b8606e0
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetKernelNames with dlsym(zeModuleGetKernelNames) -> 0x7f848b860800
TARGET LEVEL_ZERO RTL --> Implementing zeModuleDestroy with dlsym(zeModuleDestroy) -> 0x7f848b8605c0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendBarrier with dlsym(zeCommandListAppendBarrier) -> 0x7f848b85ef80
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendLaunchKernel with dlsym(zeCommandListAppendLaunchKernel) -> 0x7f848b860da0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendLaunchCooperativeKernel with dlsym(zeCommandListAppendLaunchCooperativeKernel) -> 0x7f848b860e00
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryCopy with dlsym(zeCommandListAppendMemoryCopy) -> 0x7f848b85f0a0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryCopyRegion with dlsym(zeCommandListAppendMemoryCopyRegion) -> 0x7f848b85f170
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryFill with dlsym(zeCommandListAppendMemoryFill) -> 0x7f848b85f100
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemoryPrefetch with dlsym(zeCommandListAppendMemoryPrefetch) -> 0x7f848b85f410
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListAppendMemAdvise with dlsym(zeCommandListAppendMemAdvise) -> 0x7f848b85f470
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListClose with dlsym(zeCommandListClose) -> 0x7f848b85ec20
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListCreate with dlsym(zeCommandListCreate) -> 0x7f848b85eb00
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListCreateImmediate with dlsym(zeCommandListCreateImmediate) -> 0x7f848b85eb60
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListDestroy with dlsym(zeCommandListDestroy) -> 0x7f848b85ebc0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandListReset with dlsym(zeCommandListReset) -> 0x7f848b85ec80
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueCreate with dlsym(zeCommandQueueCreate) -> 0x7f848b85e8c0
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueDestroy with dlsym(zeCommandQueueDestroy) -> 0x7f848b85e920
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueExecuteCommandLists with dlsym(zeCommandQueueExecuteCommandLists) -> 0x7f848b85e980
TARGET LEVEL_ZERO RTL --> Implementing zeCommandQueueSynchronize with dlsym(zeCommandQueueSynchronize) -> 0x7f848b85e9e0
TARGET LEVEL_ZERO RTL --> Implementing zeContextCreate with dlsym(zeContextCreate) -> 0x7f848b85e740
TARGET LEVEL_ZERO RTL --> Implementing zeContextDestroy with dlsym(zeContextDestroy) -> 0x7f848b85e800
TARGET LEVEL_ZERO RTL --> Implementing zeContextMakeMemoryResident with dlsym(zeContextMakeMemoryResident) -> 0x7f848b860f30
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceCanAccessPeer with dlsym(zeDeviceCanAccessPeer) -> 0x7f848b85e620
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetProperties with dlsym(zeDeviceGetProperties) -> 0x7f848b85e260
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetCommandQueueGroupProperties with dlsym(zeDeviceGetCommandQueueGroupProperties) -> 0x7f848b85e380
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetComputeProperties with dlsym(zeDeviceGetComputeProperties) -> 0x7f848b85e2c0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetMemoryProperties with dlsym(zeDeviceGetMemoryProperties) -> 0x7f848b85e3e0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetCacheProperties with dlsym(zeDeviceGetCacheProperties) -> 0x7f848b85e4a0
TARGET LEVEL_ZERO RTL --> Implementing zeDeviceGetGlobalTimestamps with dlsym(zeDeviceGetGlobalTimestamps) -> 0x7f848b85e6e0
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetApiVersion with dlsym(zeDriverGetApiVersion) -> 0x7f848b85df00
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetExtensionFunctionAddress with dlsym(zeDriverGetExtensionFunctionAddress) -> 0x7f848b85e080
TARGET LEVEL_ZERO RTL --> Implementing zeDriverGetExtensionProperties with dlsym(zeDriverGetExtensionProperties) -> 0x7f848b85e020
TARGET LEVEL_ZERO RTL --> Implementing zeEventCreate with dlsym(zeEventCreate) -> 0x7f848b85f590
TARGET LEVEL_ZERO RTL --> Implementing zeEventDestroy with dlsym(zeEventDestroy) -> 0x7f848b85f5f0
TARGET LEVEL_ZERO RTL --> Implementing zeEventHostReset with dlsym(zeEventHostReset) -> 0x7f848b85fa10
TARGET LEVEL_ZERO RTL --> Implementing zeEventHostSynchronize with dlsym(zeEventHostSynchronize) -> 0x7f848b85f8f0
TARGET LEVEL_ZERO RTL --> Implementing zeEventPoolCreate with dlsym(zeEventPoolCreate) -> 0x7f848b85f4d0
TARGET LEVEL_ZERO RTL --> Implementing zeEventPoolDestroy with dlsym(zeEventPoolDestroy) -> 0x7f848b85f530
TARGET LEVEL_ZERO RTL --> Implementing zeEventQueryKernelTimestamp with dlsym(zeEventQueryKernelTimestamp) -> 0x7f848b85fa70
TARGET LEVEL_ZERO RTL --> Implementing zeFenceCreate with dlsym(zeFenceCreate) -> 0x7f848b85fd20
TARGET LEVEL_ZERO RTL --> Implementing zeFenceDestroy with dlsym(zeFenceDestroy) -> 0x7f848b85fd80
TARGET LEVEL_ZERO RTL --> Implementing zeFenceHostSynchronize with dlsym(zeFenceHostSynchronize) -> 0x7f848b85fde0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelCreate with dlsym(zeKernelCreate) -> 0x7f848b8608c0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelDestroy with dlsym(zeKernelDestroy) -> 0x7f848b860920
TARGET LEVEL_ZERO RTL --> Implementing zeKernelGetName with dlsym(zeKernelGetName) -> 0x7f848b860d40
TARGET LEVEL_ZERO RTL --> Implementing zeKernelGetProperties with dlsym(zeKernelGetProperties) -> 0x7f848b860ce0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetArgumentValue with dlsym(zeKernelSetArgumentValue) -> 0x7f848b860b00
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetGroupSize with dlsym(zeKernelSetGroupSize) -> 0x7f848b8609e0
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSetIndirectAccess with dlsym(zeKernelSetIndirectAccess) -> 0x7f848b860b60
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSuggestGroupSize with dlsym(zeKernelSuggestGroupSize) -> 0x7f848b860a40
TARGET LEVEL_ZERO RTL --> Implementing zeKernelSuggestMaxCooperativeGroupCount with dlsym(zeKernelSuggestMaxCooperativeGroupCount) -> 0x7f848b860aa0
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocDevice with dlsym(zeMemAllocDevice) -> 0x7f848b860080
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocHost with dlsym(zeMemAllocHost) -> 0x7f848b8600e0
TARGET LEVEL_ZERO RTL --> Implementing zeMemAllocShared with dlsym(zeMemAllocShared) -> 0x7f848b860020
TARGET LEVEL_ZERO RTL --> Implementing zeMemFree with dlsym(zeMemFree) -> 0x7f848b860140
TARGET LEVEL_ZERO RTL --> Implementing zeMemGetAddressRange with dlsym(zeMemGetAddressRange) -> 0x7f848b860200
TARGET LEVEL_ZERO RTL --> Implementing zeMemGetAllocProperties with dlsym(zeMemGetAllocProperties) -> 0x7f848b8601a0
TARGET LEVEL_ZERO RTL --> Implementing zeModuleDynamicLink with dlsym(zeModuleDynamicLink) -> 0x7f848b860620
TARGET LEVEL_ZERO RTL --> Implementing zeModuleGetGlobalPointer with dlsym(zeModuleGetGlobalPointer) -> 0x7f848b8607a0
TARGET LEVEL_ZERO RTL --> Implementing zesDeviceEnumMemoryModules with dlsym(zesDeviceEnumMemoryModules) -> 0x7f848b88cea0
TARGET LEVEL_ZERO RTL --> Implementing zesMemoryGetState with dlsym(zesMemoryGetState) -> 0x7f848b88cf60
TARGET LEVEL_ZERO RTL --> Error: findDevices:zeInit failed with error code 2013265921, ZE_RESULT_ERROR_UNINITIALIZED
omptarget --> Registered plugin LEVEL_ZERO with 0 visible device(s)
omptarget --> Skipping plugin LEVEL_ZERO with no visible devices
PluginInterface --> Failure to check validity of image 0x28821eb0: Only executable ELF files are supportedomptarget --> No RTL found for image 0x0000000000563ea0!
omptarget --> Done registering entries!
_ _ _ ______ _______
| | | (_) | ____|__ __|
| |__| |_ _ __ | |__ | |
| __ | | '_ \\| __| | |
| | | | | |_) | | | |
|_| |_|_| .__/|_| |_|
| |
|_|
Version: 1.19.3 of 09/02/2025
****** HipFT: High Performance Flux Transport.
Authors: Ronald M. Caplan
Miko M. Stulajter
Jon A. Linker
Zoran Mikic
Predictive Science Inc.
www.predsci.com
San Diego, California, USA 92121
Number of MPI ranks total: 1
Number of MPI ranks per node: 1
Run started at:
15 September 2025 4:38:04.283 PM
--> Reading input file...
--> Initializing realization parameters...
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Default TARGET OFFLOAD policy is now disabled (no devices were found)
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 9 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
--> Setting up output directories...
--> Loading initial condition...
omptarget --> Entering data begin region for device 0 with 54 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050d094
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data begin region for device 0 with 3 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering target region for device 0 with entry point 0x000000000050fd6a
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data update region for device 0 with 1 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
omptarget --> Entering data end region for device 0 with 2 mappings
omptarget --> Offload is disabled
omptarget --> Not offloading to device 0
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 3439053 RUNNING AT matana
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Is there a guide to wipe my intel drivers and oneapi clean and start from scratch for both the driver and oneapi for Ubuntu 22.04?
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I always suspect the allocate. How is ftmp declared, is it:
real ( REAL64 ), allocatable :: ftmp(:,:,:)
and what are the vaules of ntm, npm, nr
trying to gauge the size of the allocation request
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It is declared as:
real(r_typ), dimension(:,:,:), allocatable :: ftmpwhere
integer, parameter :: r_typ = REAL64
The sizes for this example run are: (512,1024,8) (32MB)
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I looked up the B580 and I found it has 12 GB GDDR6 ??
I don't have a B580. I have an older Integrated Graphics 630.
Before running, I set
export LIBOMPTARGET_PLUGIN_PROFILE=T
Download the example 'vecadd.f90'. i tried to set up an example that may mimic what you described.
Try scaling up the array ftmp. Mine is 10x10x10. Try your sizes, see if this is working.
$ !export
export LIBOMPTARGET_PLUGIN_PROFILE=T
$ ifx -what -V -O2 -xhost -qopenmp -fopenmp-targets=spir64 -I./ vecadd.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.
Intel(R) Fortran 25.0-1485
Intel(R) Fortran 25.0-1485
Intel(R) Fortran 25.0-1485
GNU ld (GNU Binutils for Ubuntu) 2.38
$ ./a.out
ntm 10
npm 10
nr 10
calling vecadd
Host side allocation success
enter data map allocation complete
ftmp initialized on target
ftmp updated on target
done with vecadd
answer to it all 42.0000000000000
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_80_4b0d0412_veclib_mp_init__l26
Kernel 1 : __omp_offloading_80_4b0d0412_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
----------------------------------------------------------------------------------------------------------------------
Compiling : 90.42 90.42 90.42 90.42 0.00 0.00 0.00 0.00 1.00
DataAlloc : 0.11 0.01 0.00 0.05 0.00 0.00 0.00 0.00 16.00
DataRead (Device to Host) : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
DataWrite (Host to Device): 0.12 0.01 0.00 0.12 0.01 0.01 0.01 0.01 9.00
Kernel 0 : 0.16 0.16 0.16 0.16 0.02 0.02 0.02 0.02 1.00
Kernel 1 : 0.25 0.25 0.25 0.25 0.08 0.08 0.08 0.08 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 2.31 2.31 2.31 2.31 0.00 0.00 0.00 0.00 1.00
======================================================================================================================
and the code (attached as well)
module veclib
use omp_lib
use ISO_FORTRAN_ENV
contains
subroutine init(ftmp, ntm, npm, nr)
implicit none
real (REAL64), allocatable :: ftmp( :,:,: )
integer, intent(in) :: ntm, npm, nr
!...locals
integer :: i,j,k, allocate_status
character(200) :: error_message
!...host side allocation
allocate(ftmp(ntm,npm,nr),stat=allocate_status, errmsg=error_message)
if (allocate_status > 0 ) then
write(*,*) "Host side allocation failed ", error_message
else
write(*,*) "Host side allocation success "
end if
!...device allocation
!$omp target enter data map(alloc:ftmp)
write(*,*) "enter data map allocation complete"
!...initialize ftmp on device
!$omp target teams distribute parallel do map(to: ntm, npm, nr ) map(present, from: ftmp)
do k=1,nr
do j=1,npm
do i=1,ntm
ftmp(i,j,k) = 21.0_REAL64
end do
end do
end do
!$omp end target teams distribute parallel do
write(*,*) "ftmp initialized on target"
end subroutine init
!dir$ attributes noinline :: vecadd
subroutine vecadd(ftmp, ntm, npm, nr)
real (REAL64), allocatable :: ftmp( :,:,: )
integer, intent(in) :: ntm, npm, nr
integer i,j,k
call init(ftmp,ntm,npm,nr)
!$omp target teams distribute parallel do map(to: ntm, npm, nr ) map(present, from: ftmp)
do k=1,nr
do j=1,npm
do i=1,ntm
ftmp(i,j,k) = ftmp(i,j,k) + 21.0_REAL64
end do
end do
end do
!$omp end target teams distribute parallel do
write(*,*) "ftmp updated on target"
end subroutine vecadd
end module veclib
program vectest
use ISO_FORTRAN_ENV
use omp_lib
use veclib
implicit none
integer :: ntm=10
integer :: npm=10
integer :: nr=10
real (REAL64), allocatable :: ftmp( :,:,: )
print*, "ntm ", ntm
print*, "npm ", npm
print*, "nr ", nr
print*, "calling vecadd"
call vecadd(ftmp,ntm,npm,nr)
print*, "done with vecadd"
!$omp target update from(ftmp)
!$omp target exit data map(from:ftmp)
write(*,*) "answer to it all ", ftmp(ntm,npm,nr)
end program vectest
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That works fine (even at the higher resolution):
MATANA_GPU_INTEL: ~/intel_test $ ./a.out
ntm 10
npm 10
nr 10
calling vecadd
Host side allocation success
enter data map allocation complete
ftmp initialized on target
ftmp updated on target
done with vecadd
answer to it all 42.0000000000000
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Graphics [0xe20b], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_10303_d65020_veclib_mp_init__l26
Kernel 1 : __omp_offloading_10303_d65020_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
----------------------------------------------------------------------------------------------------------------------
Compiling : 34.27 34.27 34.27 34.27 0.00 0.00 0.00 0.00 1.00
DataAlloc : 5.21 0.33 0.00 1.74 0.00 0.00 0.00 0.00 16.00
DataRead (Device to Host) : 0.46 0.23 0.11 0.35 0.01 0.01 0.01 0.01 2.00
DataWrite (Host to Device): 4.63 0.51 0.20 2.28 0.11 0.01 0.01 0.02 9.00
Kernel 0 : 1.50 1.50 1.50 1.50 0.05 0.05 0.05 0.05 1.00
Kernel 1 : 0.53 0.53 0.53 0.53 0.11 0.11 0.11 0.11 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 3.29 3.29 3.29 3.29 0.00 0.00 0.00 0.00 1.00
======================================================================================================================
MATANA_GPU_INTEL: ~/intel_test $ vim test.f90
MATANA_GPU_INTEL: ~/intel_test $ ifx -what -V -O2 -xhost -qopenmp -fopenmp-targets=spir64 -I./ test.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.1 Build 20250806
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.
Intel(R) Fortran 25.0-1485
Intel(R) Fortran 25.0-1485
Intel(R) Fortran 25.0-1485
GNU ld (GNU Binutils for Ubuntu) 2.38
MATANA_GPU_INTEL: ~/intel_test $ ./a.out
ntm 512
npm 1024
nr 8
calling vecadd
Host side allocation success
enter data map allocation complete
ftmp initialized on target
ftmp updated on target
done with vecadd
answer to it all 42.0000000000000
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Graphics [0xe20b], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_10303_d65021_veclib_mp_init__l26
Kernel 1 : __omp_offloading_10303_d65021_veclib_mp_vecadd__l45
----------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
----------------------------------------------------------------------------------------------------------------------
Compiling : 33.92 33.92 33.92 33.92 0.00 0.00 0.00 0.00 1.00
DataAlloc : 6.62 0.41 0.00 2.16 0.00 0.00 0.00 0.00 16.00
DataRead (Device to Host) : 84.66 42.33 40.86 43.80 80.01 40.00 40.00 40.00 2.00
DataWrite (Host to Device): 2.02 0.22 0.07 0.40 0.10 0.01 0.00 0.01 9.00
Kernel 0 : 54.93 54.93 54.93 54.93 53.32 53.32 53.32 53.32 1.00
Kernel 1 : 68.53 68.53 68.53 68.53 68.27 68.27 68.27 68.27 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 3.30 3.30 3.30 3.30 0.00 0.00 0.00 0.00 1.00
=======================================================================================================
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I can reproduce the issue with the following code:
MATANA_GPU_INTEL: ~/intel_test $ cat test.f90
module veclib
use omp_lib
use ISO_FORTRAN_ENV
contains
subroutine init(ftmp, ntm, npm, nr)
implicit none
real (REAL64), allocatable :: ftmp( :,:,: )
integer, intent(in) :: ntm, npm, nr
!...locals
integer :: i,j,k, allocate_status
character(200) :: error_message
!...host side allocation
allocate(ftmp(ntm,npm,nr),stat=allocate_status, errmsg=error_message)
if (allocate_status > 0 ) then
write(*,*) "Host side allocation failed ", error_message
else
write(*,*) "Host side allocation success "
end if
!...device allocation
!$omp target enter data map(alloc:ftmp)
write(*,*) "enter data map allocation complete"
!...initialize ftmp on device
do concurrent (k=1:nr,j=1:npm,i=1:ntm)
ftmp(i,j,k) = 21.0_REAL64
end do
write(*,*) "ftmp initialized on target"
end subroutine init
!dir$ attributes noinline :: vecadd
subroutine vecadd(ftmp, ntm, npm, nr)
real (REAL64), allocatable :: ftmp( :,:,: )
integer, intent(in) :: ntm, npm, nr
integer i,j,k
call init(ftmp,ntm,npm,nr)
do concurrent (k=1:nr,j=1:npm,i=1:ntm)
ftmp(i,j,k) = ftmp(i,j,k) + 21.0_REAL64
end do
write(*,*) "ftmp updated on target"
end subroutine vecadd
end module veclib
program vectest
use ISO_FORTRAN_ENV
use omp_lib
use veclib
implicit none
integer :: ntm=512
integer :: npm=1024
integer :: nr=8
real (REAL64), allocatable :: ftmp( :,:,: )
print*, "ntm ", ntm
print*, "npm ", npm
print*, "nr ", nr
print*, "calling vecadd"
call vecadd(ftmp,ntm,npm,nr)
print*, "done with vecadd"
!$omp target update from(ftmp)
!$omp target exit data map(delete:ftmp)
deallocate (ftmp)
write(*,*) "answer to it all ", ftmp(ntm,npm,nr)
end program vectest
I compile with
ifx -what -V -O3 -xhost -fp-model precise -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -fopenmp-do-concurrent-maptype-modifier=present -I./ test.f90
The seg fault happens with the deallocate(ftmp).
If that is removed, the code works.
The deallocate should be fine since that should be on the CPU as the openmp target exit data map(delete:ftmp) should only deallocate on the device.
I believe this is where the problem is.
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And wrap this with your mpiexec -np 1 ./a.out
I am not skilled to read the debug output. If this simple example works, then I'll escalate your debug output from the failed run and see if a driver developer can explain what happened.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry - I put the deallocate before the array was used in the print.
Moving it at the end makes the test code still work.
So I am still at a loss for my code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(the test code works with mpif90 -f90=ifx and mpiexec -np 1 as well BTW)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Harald and I were investigating this. OH I see you also found what I found!
We are finding inconsistent behaviors that we cannot explain. What I did was to change my example. I changed the target exit map from:
!$omp target exit data map(from:ftmp)
to what you use also which is:
!$omp target exit data map(delete:ftmp)
Now I get seg faults also on this map delete. But only under certain circumstances that I cannot replicate or explain. Sometime -O0 and -O1 will work. And O2 fails. sometimes even O2 fails after a build. And this is on an old integrated graphics GPU.
So for sure there is some bug here. The drivers are very different from BMG. I will work with our OMP Offload person in the Fortran Front-end team to see if we are creating bad calls to the map exit delete.
We will continue to isolate this, but we are now reproducing what you were seeing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It was reported to me offline that unlimiting stack solved the problem. @caplanr can you confirm?
After unlimiting stack this example is working consistently for me.
I have an outstanding question as to why map(delete:ftmp) would need stack space. I would think it should be a simple call to free the memory on the device without any need for stack. I will ask the development team about this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Yes it worked but was slow.
Also, when I CTRL+C to exit the run it seg faulted.
Others at Intel are looking into it - I will keep you up to date.
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ron,
I do not have a system that can duplicate (run) this example. What comes to mind is heap corruption within the GPU, and where the error is not symptomatic until the delete. An unfounded suspicion I have is array temporaries:
Array = ArrayExpressionUsingAtemporary ! and/or a reallocLHS
From earlier post, the error crept in when the model memory requirements got larger.
In the particular case of reallocLHS, the named identities for ftmp on both sides (CPU and GPU) would need to be updated. Again this is a supposition as I do not know the peculiarities of how the ftmp identities between the CPU and GPU are performed. IIF the identities are maintained by the base address of the allocations then a reallocLHS inside the GPU would not update the identity in the CPU (as it would not be appropriate to do so. But then when the CPU then issues a delete, it would NOT have a valid address (for use as a handle within the GPU).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI,
That sounds like a reasonable explanation.
I am not sure what would have changed from previous versions to current versions of compiler+driver to change the behavior.
On another note, I can now run my codes!
Thanks to help from Intel folks over e-mail, it turned out it required setting the stack limit to unlimited with "ulimit -s unlimited"
I tried the code+run on Stampede3 which has Intel MAX 1550 GPUs and it works fine on both the 2025.2 and the older 2025.0 compilers without needed to modify the stack.
So it looks like it is either an issue with the drivers+compiler specifically for the ARC GPUs, or some system issue that only occurred after updating to the newest compiler+driver.
I plan to eventually update that system's OS from Ubuntu 22.04 with a zabbly kernel to Ubuntu 24.04 with a standard kernel and will give it a try there without modify the stack limit.
In the mean time, I can run the code again so this will work for now (although it runs ~15% slower than before).
I will mark this as "accepted solution" for now and post an update whenever I get around to upgrading the system.
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional note:
This stack issue can also be avoided by using the "-heap-arrays" flag instead of setting the stack to unlimited.
This is what I usually do with IFX, but had forgotten the flag which led to the problem.
- Ron
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page