Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29430 Diskussionen

Problem running a simple program using OpenMP and GPU

Arjen_Markus
Geehrter Beitragender II
253Aufrufe

I have been trying to use OpenMP with offloading to a GPU. The program is quite simple, but I run into a problem that I cannot diagnose. It runs fine if the size of the matrix is 128x128 (n = 128 in the program). If I use a larger value the result is a crash:

--- failure if n > 128 ---
...>ifx diffu_gpu.f90 -Qopenmp -Qopenmp-targets=spir64
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.0 Build 20241008
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.44.35217.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:diffu_gpu.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\markus\AppData\Local\Temp\17608731760846.obj
C:\Users\markus\AppData\Local\Temp\17608414llc.o
-defaultlib:omptarget.lib

...>diffu_gpu
 Start time loop ...
omptarget error: Executing target region abort target.
omptarget error: Run with
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory

I have attached the program. I have also tried to use GPU teams, but I am afraid I simply do not understand how to use the directives. In any case, it made no difference.

0 Kudos
4 Antworten
JohnNichols
Geschätzter Beitragender III
208Aufrufe

This works - I think - on Windows 11 VS 2022, I had to add the set threads call and turn on openmp in the properties page

It runs for 2280 in 25 seconds with it on and 66 with it off, the cpu time is about the same as you would expect.   With six threads it is not much faster about 23.  Diminishing returns as I understand for more threads

It runs in 7 seconds for 1180 as I have 4 threads.  

I have no idea about the GPU bit, I thought we needed CUDA for that as I have NVIDIA Card

But Jim is the expert.  

JohnNichols
Geschätzter Beitragender III
204Aufrufe

Screenshot 2026-01-19 213948.png

With one thread the CPU time is a little less but the clock time is 3 times, and I remember something from Jim that the efficiency decreases with increasing threads.  But for IFX you need to set a environment variable of call num_threads,  I prefer the Fortran way it is easier.    So for a core I7 DELL, the times is  1000 takes 7 seconds on 4 threads, 2000 takes 23, so your 10,000 will take 11 minutes.

Thanks I have not done this before. 

Screenshot 2026-01-19 214335.png

Arjen_Markus
Geehrter Beitragender II
52Aufrufe

Thanks for these experiments. Meanwhile, I got a suggestion from Damian Rouson (as a follow-up of his presentation Please, No More Loops (Than Necessary): New Patterns in Fortran 2023" yesterday) to use instead a DO CONCURRENT loop. This works and I can see that the GPU is very busy with my program. The advantage is clearly that you do not need all these OpenMP directives, but I am currently a bit puzzled about controlling the data transfer. Anyway, the fact that this version of the program does run is a big step forward :).

PGC
Einsteiger
23Aufrufe

I modified the program with the help of Gemini 3 to run on my T14s ThinkPad with a Intel(R) Iris(R) Xe Graphics 12.0.0.

I had to scale back to single precision real(4) because the gpu cannot do double.

Apparently you need to use the Codeplay oneAPI Plugins to use a Nvidia gpu. I have not done that, seems complicated. Perhaps someone can put together a simple working example.

 

On my laptop with the Iris(R) Xe I get this:

Starting Performance Comparison...
Array Size: 100000
Math intensity: 20000 operations per element

Running on CPU...
CPU Time: 3.9638 seconds
Running on GPU (Iris Xe)...
GPU Time: 0.0577 seconds

Speedup: 68.66x

Antworten