Re: Problem running a simple program using OpenMP and GPU

Arjen_Markus · ‎01-19-2026

I have been trying to use OpenMP with offloading to a GPU. The program is quite simple, but I run into a problem that I cannot diagnose. It runs fine if the size of the matrix is 128x128 (n = 128 in the program). If I use a larger value the result is a crash:

--- failure if n > 128 ---
...>ifx diffu_gpu.f90 -Qopenmp -Qopenmp-targets=spir64
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.0 Build 20241008
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.44.35217.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:diffu_gpu.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\markus\AppData\Local\Temp\17608731760846.obj
C:\Users\markus\AppData\Local\Temp\17608414llc.o
-defaultlib:omptarget.lib

...>diffu_gpu
 Start time loop ...
omptarget error: Executing target region abort target.
omptarget error: Run with
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory

I have attached the program. I have also tried to use GPU teams, but I am afraid I simply do not understand how to use the directives. In any case, it made no difference.

JohnNichols · ‎01-19-2026

This works - I think - on Windows 11 VS 2022, I had to add the set threads call and turn on openmp in the properties page

It runs for 2280 in 25 seconds with it on and 66 with it off, the cpu time is about the same as you would expect. With six threads it is not much faster about 23. Diminishing returns as I understand for more threads

It runs in 7 seconds for 1180 as I have 4 threads.

I have no idea about the GPU bit, I thought we needed CUDA for that as I have NVIDIA Card

But Jim is the expert.

JohnNichols · ‎01-19-2026

With one thread the CPU time is a little less but the clock time is 3 times, and I remember something from Jim that the efficiency decreases with increasing threads. But for IFX you need to set a environment variable of call num_threads, I prefer the Fortran way it is easier. So for a core I7 DELL, the times is 1000 takes 7 seconds on 4 threads, 2000 takes 23, so your 10,000 will take 11 minutes.

Thanks I have not done this before.

Arjen_Markus · ‎01-22-2026

Thanks for these experiments. Meanwhile, I got a suggestion from Damian Rouson (as a follow-up of his presentation Please, No More Loops (Than Necessary): New Patterns in Fortran 2023" yesterday) to use instead a DO CONCURRENT loop. This works and I can see that the GPU is very busy with my program. The advantage is clearly that you do not need all these OpenMP directives, but I am currently a bit puzzled about controlling the data transfer. Anyway, the fact that this version of the program does run is a big step forward :).