Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29427 Discussions

Problem running a simple program using OpenMP and GPU

Arjen_Markus
Honored Contributor II
105 Views

I have been trying to use OpenMP with offloading to a GPU. The program is quite simple, but I run into a problem that I cannot diagnose. It runs fine if the size of the matrix is 128x128 (n = 128 in the program). If I use a larger value the result is a crash:

--- failure if n > 128 ---
...>ifx diffu_gpu.f90 -Qopenmp -Qopenmp-targets=spir64
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.0 Build 20241008
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.44.35217.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:diffu_gpu.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\markus\AppData\Local\Temp\17608731760846.obj
C:\Users\markus\AppData\Local\Temp\17608414llc.o
-defaultlib:omptarget.lib

...>diffu_gpu
 Start time loop ...
omptarget error: Executing target region abort target.
omptarget error: Run with
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory

I have attached the program. I have also tried to use GPU teams, but I am afraid I simply do not understand how to use the directives. In any case, it made no difference.

0 Kudos
2 Replies
JohnNichols
Valued Contributor III
60 Views

This works - I think - on Windows 11 VS 2022, I had to add the set threads call and turn on openmp in the properties page

It runs for 2280 in 25 seconds with it on and 66 with it off, the cpu time is about the same as you would expect.   With six threads it is not much faster about 23.  Diminishing returns as I understand for more threads

It runs in 7 seconds for 1180 as I have 4 threads.  

I have no idea about the GPU bit, I thought we needed CUDA for that as I have NVIDIA Card

But Jim is the expert.  

0 Kudos
JohnNichols
Valued Contributor III
56 Views

Screenshot 2026-01-19 213948.png

With one thread the CPU time is a little less but the clock time is 3 times, and I remember something from Jim that the efficiency decreases with increasing threads.  But for IFX you need to set a environment variable of call num_threads,  I prefer the Fortran way it is easier.    So for a core I7 DELL, the times is  1000 takes 7 seconds on 4 threads, 2000 takes 23, so your 10,000 will take 11 minutes.

Thanks I have not done this before. 

Screenshot 2026-01-19 214335.png

0 Kudos
Reply