Re: How to enable the GPU to run a FORTRAN program? – Seite 3

eliopoulos · ‎01-23-2024

How can I enable the GPU to run a parallel do loop in a FORTRAN program? I have tried: "!$omp target teams distribute parallel do" but the GPU does not run. Any advice?

eliopoulos · ‎02-02-2024

I have created a reproducer that produces the same errors with my program.

Barbara_P_Intel · ‎02-05-2024

Ah! You use the matmul() instrinsic. When you compile with /Qopenmp on Windows or -qopenmp on Linux the matmul() intrinsic is automatically parallelized using OpenMP.

That can be disabled with /Qopt-matmul- on Windows or -no-opt-matmul on LInux. See the Developer Guide.

So far I've just run this on CPU. iGPU is next.

Attached is an update of z.for called z.r1.for. Here's the output I got for CPU:

Q:\06129705>ifx /Qopenmp z.r1.for
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240103
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.38.33130.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:z.r1.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
z.r1.obj

Q:\06129705>z.r1.exe
 Number of procs is            8
   6000000000000.00        6000000000000.00        6000000000000.00
   6000000000000.00        6000000000000.00        6000000000000.00
   6000000000000.00        6000000000000.00        6000000000000.00

Barbara_P_Intel · ‎02-06-2024

I have the Intel GPU version running and printing the same answers as the CPU version. It's attached as z.r2.for.

Two MAJOR changes:

double precision is not supported on Intel(R) Iris(R) Xe Graphics. I made the arrays real.
It's poor coding practice to use the same variable for the loop index and the end of the loop. This loop construct yields the wrong answer on Intel GPU.

! do i=1,i

! Using this construct gives the same answer as running on CPU.

do j=1,i

When I run it in a oneAPI command window I type:

set LIBOMPTARGET_PLUGIN_PROFILE=T

to get a profile printout. I set that to be sure it runs on Intel GPU. No profile, it didn't offload.

>set LIBOMPTARGET_PLUGIN_PROFILE=T
>ifx /Qopenmp /Qopenmp-targets=spir64 z.r2.for
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240103
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.38.33130.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:z.r2.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
C:\Users\bperz\AppData\Local\Temp\1932873.obj
C:\Users\bperz\AppData\Local\Temp\19328414.o
-defaultlib:omptarget.lib

>z.r2.exe
 Number of procs is            8
  6.0000001E+12  6.0000001E+12  6.0000001E+12  6.0000001E+12  6.0000001E+12
  6.0000001E+12  6.0000001E+12  6.0000001E+12  6.0000001E+12
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Iris(R) Xe Graphics, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_80ff4b02_5bc1261f_MAIN___l27
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     397.03    397.03    397.03    397.03      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       0.58      0.06      0.00      0.42      0.00      0.00      0.00      0.00      9.00
DataRead (Device to Host) :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
DataWrite (Host to Device):       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
Kernel 0                  :       0.51      0.51      0.51      0.51      0.10      0.10      0.10      0.10      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       3.88      3.88      3.88      3.88      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

There are some other comments in z.r2.for.

@eliopoulos, let me know how this works for you.

eliopoulos · ‎02-06-2024

It works. I have the same output. Do better Intel GPU support double precision?

Barbara_P_Intel · ‎02-06-2024

Yes, the higher end Intel GPUs support double precision.

The Intel(R) Iris(R) Xe Graphics GPUs are designed for gaming, not HPC workloads. Gamers don't need those extra bits for their graphics.

eliopoulos · ‎02-06-2024

But /Qmkl is not supported because /size-llp64 is not supported by ifx, I guess.

Barbara_P_Intel · ‎02-06-2024

I don't know how MKL works on Intel(R) Iris(R) Xe Graphics. You might ask on the MKL Forum. Have a simple reproducer handy. Here's the link to the MKL Forum.

ifx definitely supports double precision on CPU and Intel GPUs that support double precision.

eliopoulos · ‎02-06-2024

Thank you.