Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29430 Discussions

Problem running a simple program using OpenMP and GPU

Arjen_Markus
Honored Contributor II
218 Views

I have been trying to use OpenMP with offloading to a GPU. The program is quite simple, but I run into a problem that I cannot diagnose. It runs fine if the size of the matrix is 128x128 (n = 128 in the program). If I use a larger value the result is a crash:

--- failure if n > 128 ---
...>ifx diffu_gpu.f90 -Qopenmp -Qopenmp-targets=spir64
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.0 Build 20241008
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.44.35217.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:diffu_gpu.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\markus\AppData\Local\Temp\17608731760846.obj
C:\Users\markus\AppData\Local\Temp\17608414llc.o
-defaultlib:omptarget.lib

...>diffu_gpu
 Start time loop ...
omptarget error: Executing target region abort target.
omptarget error: Run with
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory

I have attached the program. I have also tried to use GPU teams, but I am afraid I simply do not understand how to use the directives. In any case, it made no difference.

0 Kudos
3 Replies
JohnNichols
Valued Contributor III
173 Views

This works - I think - on Windows 11 VS 2022, I had to add the set threads call and turn on openmp in the properties page

It runs for 2280 in 25 seconds with it on and 66 with it off, the cpu time is about the same as you would expect.   With six threads it is not much faster about 23.  Diminishing returns as I understand for more threads

It runs in 7 seconds for 1180 as I have 4 threads.  

I have no idea about the GPU bit, I thought we needed CUDA for that as I have NVIDIA Card

But Jim is the expert.  

0 Kudos
JohnNichols
Valued Contributor III
169 Views

Screenshot 2026-01-19 213948.png

With one thread the CPU time is a little less but the clock time is 3 times, and I remember something from Jim that the efficiency decreases with increasing threads.  But for IFX you need to set a environment variable of call num_threads,  I prefer the Fortran way it is easier.    So for a core I7 DELL, the times is  1000 takes 7 seconds on 4 threads, 2000 takes 23, so your 10,000 will take 11 minutes.

Thanks I have not done this before. 

Screenshot 2026-01-19 214335.png

0 Kudos
Arjen_Markus
Honored Contributor II
17 Views

Thanks for these experiments. Meanwhile, I got a suggestion from Damian Rouson (as a follow-up of his presentation Please, No More Loops (Than Necessary): New Patterns in Fortran 2023" yesterday) to use instead a DO CONCURRENT loop. This works and I can see that the GPU is very busy with my program. The advantage is clearly that you do not need all these OpenMP directives, but I am currently a bit puzzled about controlling the data transfer. Anyway, the fact that this version of the program does run is a big step forward :).

0 Kudos
Reply