Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Questions regarding GPUs and OCLOC

Arjen_Markus
Honored Contributor I
5,861 Views

I want to experiment a bit with GPUs, but I am getting lost wrt the actual hardware and the support from ifx. Here is the situation:

I work with a laptop running Windows. According to the task manager it has two GPUs, Intel UHD Graphics and NVIDIA RTX A1000 Laptop GPU. I have no idea if ifx supports the first (the second certainly is not supported). So, I try to build a program that exploits GPUs via OpenMP offloading. So far so good.

The option -Qopenmp-targets:spir64 does have an effect, in that with the environment variable LIBOMPTARGET_DEBUG set to 1 I get a lot of debugging information. If I unset that variable the program hangs and after an interruption via control-C, I get the message:

forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
KERNELBASE.dll     00007FFE76522943  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFE76FB7614  Unknown               Unknown  Unknown
ntdll.dll          00007FFE787E26F1  Unknown               Unknown  Unknown
Libomptarget error: Host ptr 0x00007ff6f3f795ec does not have a matching target pointer.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

My interpretation is that the Intel GPU is not actually used or cannot be connected or is simply not supported. WEll, that can happen. But looking for an alternative (or better: looking for the list of devices that are supported), I cane across the option -Qopenmp-targets:spir64_gen.

If I try that, I get:

Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

ifx: warning #10441: The OpenCL offline compiler could not be found and is required for AOT compilation.See "https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html" for more information.
ifx: error #10037: could not find 'ocloc'
ifx: error #10401: error running 'Offline Compiler'

So I try to find out how to get ocloc. For Windows it ought to be part of the Intel DPC++/C++ installation. As far as I can tell from the output of icx on my laptop, that has been installed:

Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

icx: error: no input files

But I cannot find a program "ocloc.exe" on the laptop. Or anything that resembles that name.

So I am left with a couple of questions:

  • Is the Intel GPU I have supported by ifx or icx?
  • What do I need to do to get ocloc and thereby enable "spir64_gen", if that would be a solution?
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
5,481 Views

>> Intel® HD Graphics 630

That "GPU" has native support for double precision, many other GPU's do not. IIF the simulate douple precision via software, then expect a slowdown. IIF they do not have software support for double precision, then they should report this if failing.

Arjen, you could experiment with setting your real's to real(4). If this works, then it indicates lack of (simulated) double precision.

Jim Dempsey

View solution in original post

26 Replies
Barbara_P_Intel
Employee
4,360 Views

Did you install the Intel GPU device driver? Information on how to do that is in the System Requirements article. Supported Intel CPUs are listed with the driver information.

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,347 Views

Well, to be sure (I did do an explicit update before) I followed the instructions from that page and hoped I got the right one, as none of the entries lists exactly the GPU my system apparently has. But that was unsuccessful in the sense that I get the same sort of error. The program hangs and upon control-C I get similar messages.

0 Kudos
Arjen_Markus
Honored Contributor I
4,344 Views

Judging from the debug output I would say it is working:

Libomptarget --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --> Init Level0 plugin!
Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --> omp_get_max_teams() returned 0
Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --> Looking for Level0 devices...
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) UHD Graphics 770
Target LEVEL0 RTL --> Found 1 root devices, 1 total devices.
Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --> -- 0
Target LEVEL0 RTL --> Root Device Information
Target LEVEL0 RTL --> Device 0
Target LEVEL0 RTL --> -- Name                         : Intel(R) UHD Graphics 770
Target LEVEL0 RTL --> -- PCI ID                       : 0x4688

 and lots more, but I get no output and I have to terminate the program because I am running out of patience (the same program with classic OpenMP statement runs in half a second, the program without any OpenMP takes several seconds).

0 Kudos
Arjen_Markus
Honored Contributor I
4,339 Views

Oh, I misinterpreted the behaviour of the program! I added a write statement to the time loop (the body of that loop is in the target section) and I see that this is running, be it very slow. The task manager indeed indicates that the GPU is doing a lot of work, but I have succeeded in slowing down the program by at least a factor 100. Not entirely the result I expected :).

0 Kudos
Ron_Green
Moderator
4,309 Views

you can set the env var

LIBOMPTARGET_PLUGIN_PROFILE=T

to get an idea of how much data movement is occuring, how much time in the kernel, etc.

 

Is the kernel small enough to share?  For DO loops, did you use

!$omp target teams distribute parallel do

 

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,275 Views

No, I am still learning how exactly to use the various keywords. I know my way around the classic OpenMP keywords, but these are new and I have to experiment. So I did. Using the keywords you suggested does improive the performance of the program, but it is still very much slower than the sequential version. I have copied the code below (attaching did not work ) - it is a toy program, easy enough for experimentation.

! diffu.f90 --
!     Solve a diffusion-reaction equation: nabla2 u = alpha * exp(u)
!
!     Note:
!     The program is much slower than without OpenMP offloading. This clearly
!     requires more fine-tuning.
!
!
program diffu
    use omp_lib

    implicit none
    real, allocatable :: u(:,:), unew(:,:)
    real              :: alpha, delt
    integer           :: i, j, k, n
    real              :: time1, time2
    integer           :: cnt1, cnt2, cnt_rate

    open( 10, file = 'diffu.out' )

    n     = 1280
    allocate( u(n,n), unew(n,n) )

    delt  = 0.1
    alpha = 0.01
    u     = 0.0

!!  u(1,:) = 1.0

!!    call omp_set_num_threads(128)

    call system_clock( cnt1, cnt_rate )
    call cpu_time( time1 )

    write(*,*) 'Start time loop ...'

    do k = 1,1000
!!        write(*,*) k
!XXXX !$omp target map(tofrom: u) map(from:unew)
!XXXX !$omp teams

!$omp target teams distribute parallel do
        do j=2,n-1
            do i=2,n-1
                unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
            enddo
        enddo
!$omp target teams distribute parallel do
        do j=2,n-1
            do i=2,n-1
                u(i,j) = unew(i,j)
            enddo
        enddo
    enddo

    call cpu_time( time2 )
    call system_clock( cnt2 )

    do j =1,n
        write( 10, '(*(f10.4))' ) u(:,j)
    enddo

    write(*,*) 'CPU time:   ', time2 - time1
    write(*,*) 'Clock time: ', (cnt2 - cnt1) / real(cnt_rate)
end program

As you can see, it contains some experiments - the to and tofrom clauses.

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,260 Views

Arjen,

I am inexperienced with GPU programming. I do have some observations on your codeing example.

 

Your 1st !XXXX commented section (when uncommented) can be thought of as an analog of !$omp parallel (without the DO). IOW it starts the encapsulation of a parallel region (terminated with !$omp end parallel). In this case of target (without teams), the encapsulate code (through !$omp end target) is intended to be offloaded to the GPU.

Your code as written is using !$omp target teams ... within the do k= loop.

Meaning each (of the two) instance specifies both an offload region of code plus an offload teams distribution.

IOW each instance performs a copy in from host to GPU and copy back from GPU to host.

 

Perhaps a better approach is to place the do k= loop, or a portion of that loop inside the offload region.

...
stride = 10
do k=1,1000, stride
  write(*,*) k
  !$omp target map(tofrom: u) map(from:unew)
  do kk=k,min(k+stride-1,1000)
    !$omp teams distribute parallel do
    do j=2,n-1
      do i=2,n-1
        unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
      enddo
    enddo
    !$omp teams distribute parallel do
    do j=2,n-1
      do i=2,n-1
        u(i,j) = unew(i,j)
      enddo
    enddo
  enddo
  !$omp end target
end do
call cpu_time( time2 )
call system_clock( cnt2 )
...

Or place the entire k loop inside the target region.

The code above is performs a progress report every 10 steps.

Also, you can use persistent data within the GPU and copy out what is needed when needed.

 

I wish the documentation included examples, or links to examples of the various offload features.

 

Jim Dempsey

 

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,202 Views

That is pretty much what I had in mind: copy the data to the GPU, do all the calculations there and then bring back the results. It is indeed the lack of clear examples that makes it an exercise in patience, trial and error. At the very least I am glad I could establish that the GPU is recognised and is doing the work I wanted it to do. Now I need to figure out what the right invocation is to make it worthwhile - there are quite a few permutations possible. I will look into your suggestions, thanks.

 

0 Kudos
Barbara_P_Intel
Employee
4,193 Views

I wrote a Fortran OpenMP offload tutorial that will be published in the oneAPI Samples GitHub when oneAPI 2023.2 is released later this month. It's based on a matrix multiply. 

This is what the code should look like when the tutorial is complete.  The size of the matrix may need to be changed depending on the memory available in the GPU.

ARGH! The ELSE for line 13 doesn't show. You'll need that.

program matrix_multiply
   use omp_lib
   implicit none
   integer :: i, j, k, myid, m, n
   real(8), allocatable, dimension(:,:) :: a, b, c, c_serial

   n = 2600

    myid = OMP_GET_THREAD_NUM()
    if (myid .eq. 0) then
      print *, 'matrix size ', n
      print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()

      print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target 
      if (OMP_IS_INITIAL_DEVICE()) then
        print *, ' Running on CPU'
        else
        print *, ' Running on GPU'
      endif
!$omp end target 
    endif

      allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))

! Initialize matrices
      do j=1,n
         do i=1,n
            a(i,j) = i + j - 1
            b(i,j) = i - j + 1
         enddo
      enddo
      c = 0.0
      c_serial = 0.0

!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
      do j=1,n
         do i=1,n
            do k=1,n
                c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
         enddo
      enddo
!$omp end target teams

! serial compute matrix multiplication
      do j=1,n
         do i=1,n
            do k=1,n
                c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
            enddo
         enddo
      enddo

! verify result
      do j=1,n
         do i=1,n
            if (c_serial(i,j) .ne. c(i,j)) then
               print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
            exit
            endif
         enddo
      enddo

      print *,'PASSED'

end program matrix_multiply

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,188 Views

Ah, thanks - again, I will study this. It will certainly be worthwhile to see the directives in action.

0 Kudos
JohnNichols
Valued Contributor III
3,934 Views

My main computer is chugging along on a FEM problem, so I used the old NUC core i3 to solve this problem.  

it hangs are line 42 with a c is not allocated 

0 Kudos
Arjen_Markus
Honored Contributor I
4,192 Views

If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,164 Views

>>If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?

 

Perhaps Barbara can correct me should I be wrong

 

!$omp target (without teams), through !$omp end target...

can be use to install code and data into GPU (or reuse code from prior use) .AND. begin a serial sequence within the GPU.

Within the above target region, the use of !$omp teams distribute can be used to to form a team and distribute a DO loop(s) for a parallel processing of the loop(s).

 

In Barbara's example, she is using !$omp target teams ... to enter the offload region with a parallel team running, and then within the parallel offload region use !$omp distribute to partition the DO loop.

 

Jim Dempsey

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,124 Views

This shows the importanc of examples :). I hope that the tutorial I found will bring me the correct understanding.

 

In the meantime, things are not entirely straightforward. Jim's suggestion leads to the following error messages from the compiler:

Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

diffu_v5_jim.f90(39): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
        do kk=k,min(k+stride-1,1000)
--------^
diffu_v5_jim.f90(46): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
            !$omp teams distribute parallel do
------------------^
diffu_v5_jim.f90(52): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
        enddo
--------^
compilation aborted for diffu_v5_jim.f90 (code 1)

whereas Barbara's example leads to run-time errors:

 matrix size         2600
 Number of CPU procs is            1
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

 

0 Kudos
Barbara_P_Intel
Employee
4,106 Views

>> whereas Barbara's example leads to run-time errors:

Can you try a smaller matrix? I had to do that for one Intel GPU I tested.

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,100 Views

I reduced the size by a factor 10 and got a very similar error message. I am also a trifle puzzled by the statement that there is only one CPU. A "hello" program clearly showed 24 CPUs (or better: a defualt of 24 threads being started).

0 Kudos
Barbara_P_Intel
Employee
4,093 Views

With the matmul, there is only 1 CPU because the OpenMP directives are all for TARGET, not CPU. You can modify the directives to run on CPU only

I just copied what I posted, added the "else" and ran it successfully on Linux with PVC. However, the output didn't say I ran on GPU. So I removed the "else" and got this.

 

+ a.out
 matrix size         2600
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU
 PASSED

 

My next step is to load the GPU driver on my laptop and see what's up.

I also set the environment variable LIBOMPTARGET_PLUGIN_PROFILE=T and got this output because I set that.

======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Data Center GPU Max 1100, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_45_810c026c_MAIN___l14
Kernel 1                  : __omp_offloading_45_810c026c_MAIN___l35
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     421.63    421.63    421.63    421.63      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       3.11      0.22      0.00      0.81      0.00      0.00      0.00      0.00     14.00
DataRead (Device to Host) :       0.00      0.00      0.00      0.00      2.38      2.38      2.38      2.38      1.00
DataWrite (Host to Device):       4.39      0.49      0.01      1.71      7.39      0.82      0.00      2.47      9.00
Kernel 0                  :       1.87      1.87      1.87      1.87      0.01      0.01      0.01      0.01      1.00
Kernel 1                  :       0.07      0.07      0.07      0.07   2865.43   2865.43   2865.43   2865.43      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       1.51      1.51      1.51      1.51      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

 

0 Kudos
Arjen_Markus
Honored Contributor I
4,072 Views

I removed the "else" statement and got the message that one OpenMP device is available. And then the error message.

 

With one version of the original diffusion program I get the following profile:

 

 Start time loop ...
 CPU time:      10.23438
 Clock time:    10.23900
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) UHD Graphics 770, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_f2b3f24a_efe1e_MAIN___l34
Kernel 1                  : __omp_offloading_f2b3f24a_efe1e_MAIN___l40
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     604.39    604.39    604.39    604.39      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       1.49      0.00      0.00      0.05      0.00      0.00      0.00      0.00   8008.00
DataRead (Device to Host) :     797.53      0.20      0.16      0.90      0.00      0.00      0.00      0.00   4000.00
DataWrite (Host to Device):     902.03      0.09      0.00      1.66      0.00      0.00      0.00      0.00  10000.00
Kernel 0                  :    2758.64      2.76      2.51     11.12   2582.38      2.58      2.43      3.55   1000.00
Kernel 1                  :    5127.18      5.13      4.83      6.22   4958.39      4.96      4.75      6.06   1000.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :      12.91     12.91     12.91     12.91      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

 

0 Kudos
Barbara_P_Intel
Employee
4,050 Views

That's great that you have something that runs on the GPU! That table is proof! So it seems your environment is set up.

I don't know why the matmul is failing. I just ran it successfully on an Intel Core i7-8809G 3.10GHz with Intel® HD Graphics 630. An older machine, but it worked.

 

0 Kudos
Reply