- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been trying to use OpenMP with offloading to a GPU. The program is quite simple, but I run into a problem that I cannot diagnose. It runs fine if the size of the matrix is 128x128 (n = 128 in the program). If I use a larger value the result is a crash:
--- failure if n > 128 ---
...>ifx diffu_gpu.f90 -Qopenmp -Qopenmp-targets=spir64
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.0 Build 20241008
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.44.35217.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:diffu_gpu.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\markus\AppData\Local\Temp\17608731760846.obj
C:\Users\markus\AppData\Local\Temp\17608414llc.o
-defaultlib:omptarget.lib
...>diffu_gpu
Start time loop ...
omptarget error: Executing target region abort target.
omptarget error: Run with
omptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
omptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
omptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
I have attached the program. I have also tried to use GPU teams, but I am afraid I simply do not understand how to use the directives. In any case, it made no difference.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This works - I think - on Windows 11 VS 2022, I had to add the set threads call and turn on openmp in the properties page
It runs for 2280 in 25 seconds with it on and 66 with it off, the cpu time is about the same as you would expect. With six threads it is not much faster about 23. Diminishing returns as I understand for more threads
It runs in 7 seconds for 1180 as I have 4 threads.
I have no idea about the GPU bit, I thought we needed CUDA for that as I have NVIDIA Card
But Jim is the expert.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With one thread the CPU time is a little less but the clock time is 3 times, and I remember something from Jim that the efficiency decreases with increasing threads. But for IFX you need to set a environment variable of call num_threads, I prefer the Fortran way it is easier. So for a core I7 DELL, the times is 1000 takes 7 seconds on 4 threads, 2000 takes 23, so your 10,000 will take 11 minutes.
Thanks I have not done this before.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page