- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can I enable the GPU to run a parallel do loop in a FORTRAN program? I have tried: "!$omp target teams distribute parallel do" but the GPU does not run. Any advice?
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah! You use the matmul() instrinsic. When you compile with /Qopenmp on Windows or -qopenmp on Linux the matmul() intrinsic is automatically parallelized using OpenMP.
That can be disabled with /Qopt-matmul- on Windows or -no-opt-matmul on LInux. See the Developer Guide.
So far I've just run this on CPU. iGPU is next.
Attached is an update of z.for called z.r1.for. Here's the output I got for CPU:
Q:\06129705>ifx /Qopenmp z.r1.for
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240103
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.38.33130.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:z.r1.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
z.r1.obj
Q:\06129705>z.r1.exe
Number of procs is 8
6000000000000.00 6000000000000.00 6000000000000.00
6000000000000.00 6000000000000.00 6000000000000.00
6000000000000.00 6000000000000.00 6000000000000.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the Intel GPU version running and printing the same answers as the CPU version. It's attached as z.r2.for.
Two MAJOR changes:
- double precision is not supported on Intel(R) Iris(R) Xe Graphics. I made the arrays real.
- It's poor coding practice to use the same variable for the loop index and the end of the loop. This loop construct yields the wrong answer on Intel GPU.
! do i=1,i
! Using this construct gives the same answer as running on CPU.
do j=1,i
When I run it in a oneAPI command window I type:
set LIBOMPTARGET_PLUGIN_PROFILE=T
to get a profile printout. I set that to be sure it runs on Intel GPU. No profile, it didn't offload.
>set LIBOMPTARGET_PLUGIN_PROFILE=T
>ifx /Qopenmp /Qopenmp-targets=spir64 z.r2.for
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240103
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.38.33130.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:z.r2.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\bperz\AppData\Local\Temp\1932873.obj
C:\Users\bperz\AppData\Local\Temp\19328414.o
-defaultlib:omptarget.lib
>z.r2.exe
Number of procs is 8
6.0000001E+12 6.0000001E+12 6.0000001E+12 6.0000001E+12 6.0000001E+12
6.0000001E+12 6.0000001E+12 6.0000001E+12 6.0000001E+12
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Iris(R) Xe Graphics, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_80ff4b02_5bc1261f_MAIN___l27
----------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
----------------------------------------------------------------------------------------------------------------------
Compiling : 397.03 397.03 397.03 397.03 0.00 0.00 0.00 0.00 1.00
DataAlloc : 0.58 0.06 0.00 0.42 0.00 0.00 0.00 0.00 9.00
DataRead (Device to Host) : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
DataWrite (Host to Device): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Kernel 0 : 0.51 0.51 0.51 0.51 0.10 0.10 0.10 0.10 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 3.88 3.88 3.88 3.88 0.00 0.00 0.00 0.00 1.00
======================================================================================================================
There are some other comments in z.r2.for.
@eliopoulos, let me know how this works for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It works. I have the same output. Do better Intel GPU support double precision?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, the higher end Intel GPUs support double precision.
The Intel(R) Iris(R) Xe Graphics GPUs are designed for gaming, not HPC workloads. Gamers don't need those extra bits for their graphics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But /Qmkl is not supported because /size-llp64 is not supported by ifx, I guess.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know how MKL works on Intel(R) Iris(R) Xe Graphics. You might ask on the MKL Forum. Have a simple reproducer handy. Here's the link to the MKL Forum.
ifx definitely supports double precision on CPU and Intel GPUs that support double precision.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »