topic Re: Questions regarding GPUs and OCLOC in Intel® Fortran Compiler

Questions regarding GPUs and OCLOC

Arjen_Markus — Fri, 07 Jul 2023 10:50:51 GMT

I want to experiment a bit with GPUs, but I am getting lost wrt the actual hardware and the support from ifx. Here is the situation:

I work with a laptop running Windows. According to the task manager it has two GPUs, Intel UHD Graphics and NVIDIA RTX A1000 Laptop GPU. I have no idea if ifx supports the first (the second certainly is not supported). So, I try to build a program that exploits GPUs via OpenMP offloading. So far so good.

The option -Qopenmp-targets:spir64 does have an effect, in that with the environment variable LIBOMPTARGET_DEBUG set to 1 I get a lot of debugging information. If I unset that variable the program hangs and after an interruption via control-C, I get the message:

forrtl: error (200): program aborting due to control-C event Image PC Routine Line Source KERNELBASE.dll 00007FFE76522943 Unknown Unknown Unknown KERNEL32.DLL 00007FFE76FB7614 Unknown Unknown Unknown ntdll.dll 00007FFE787E26F1 Unknown Unknown Unknown Libomptarget error: Host ptr 0x00007ff6f3f795ec does not have a matching target pointer. Libomptarget error: Run with Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information. Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime. Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory

My interpretation is that the Intel GPU is not actually used or cannot be connected or is simply not supported. WEll, that can happen. But looking for an alternative (or better: looking for the list of devices that are supported), I cane across the option -Qopenmp-targets:spir64_gen.

If I try that, I get:

Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320 Copyright (C) 1985-2023 Intel Corporation. All rights reserved. ifx: warning #10441: The OpenCL offline compiler could not be found and is required for AOT compilation.See "https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html" for more information. ifx: error #10037: could not find 'ocloc' ifx: error #10401: error running 'Offline Compiler'

So I try to find out how to get ocloc. For Windows it ought to be part of the Intel DPC++/C++ installation. As far as I can tell from the output of icx on my laptop, that has been installed:

But I cannot find a program "ocloc.exe" on the laptop. Or anything that resembles that name.

So I am left with a couple of questions:

Is the Intel GPU I have supported by ifx or icx?
What do I need to do to get ocloc and thereby enable "spir64_gen", if that would be a solution?

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Fri, 07 Jul 2023 11:19:40 GMT

Did you install the Intel GPU device driver? Information on how to do that is in the System Requirements article. Supported Intel CPUs are listed with the driver information.

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Fri, 07 Jul 2023 11:47:27 GMT

Well, to be sure (I did do an explicit update before) I followed the instructions from that page and hoped I got the right one, as none of the entries lists exactly the GPU my system apparently has. But that was unsuccessful in the sense that I get the same sort of error. The program hangs and upon control-C I get similar messages.

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Fri, 07 Jul 2023 11:57:47 GMT

Judging from the debug output I would say it is working:

Libomptarget --> Loading library 'omptarget.rtl.level0.dll'... Target LEVEL0 RTL --> Init Level0 plugin! Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647 Target LEVEL0 RTL --> omp_get_max_teams() returned 0 Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'! Target LEVEL0 RTL --> Looking for Level0 devices... Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) UHD Graphics 770 Target LEVEL0 RTL --> Found 1 root devices, 1 total devices. Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]]) Target LEVEL0 RTL --> -- 0 Target LEVEL0 RTL --> Root Device Information Target LEVEL0 RTL --> Device 0 Target LEVEL0 RTL --> -- Name : Intel(R) UHD Graphics 770 Target LEVEL0 RTL --> -- PCI ID : 0x4688

and lots more, but I get no output and I have to terminate the program because I am running out of patience (the same program with classic OpenMP statement runs in half a second, the program without any OpenMP takes several seconds).

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Fri, 07 Jul 2023 12:05:10 GMT

Oh, I misinterpreted the behaviour of the program! I added a write statement to the time loop (the body of that loop is in the target section) and I see that this is running, be it very slow. The task manager indeed indicates that the GPU is doing a lot of work, but I have succeeded in slowing down the program by at least a factor 100. Not entirely the result I expected :).

Re: Questions regarding GPUs and OCLOC

Ron_Green — Fri, 07 Jul 2023 19:18:56 GMT

you can set the env var

LIBOMPTARGET_PLUGIN_PROFILE=T

to get an idea of how much data movement is occuring, how much time in the kernel, etc.

Is the kernel small enough to share? For DO loops, did you use

!$omp target teams distribute parallel do

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Sat, 08 Jul 2023 13:51:15 GMT

No, I am still learning how exactly to use the various keywords. I know my way around the classic OpenMP keywords, but these are new and I have to experiment. So I did. Using the keywords you suggested does improive the performance of the program, but it is still very much slower than the sequential version. I have copied the code below (attaching did not work ) - it is a toy program, easy enough for experimentation.

! diffu.f90 -- ! Solve a diffusion-reaction equation: nabla2 u = alpha * exp(u) ! ! Note: ! The program is much slower than without OpenMP offloading. This clearly ! requires more fine-tuning. ! ! program diffu use omp_lib implicit none real, allocatable :: u(:,:), unew(:,:) real :: alpha, delt integer :: i, j, k, n real :: time1, time2 integer :: cnt1, cnt2, cnt_rate open( 10, file = 'diffu.out' ) n = 1280 allocate( u(n,n), unew(n,n) ) delt = 0.1 alpha = 0.01 u = 0.0 !! u(1,:) = 1.0 !! call omp_set_num_threads(128) call system_clock( cnt1, cnt_rate ) call cpu_time( time1 ) write(*,*) 'Start time loop ...' do k = 1,1000 !! write(*,*) k !XXXX !$omp target map(tofrom: u) map(from:unew) !XXXX !$omp teams !$omp target teams distribute parallel do do j=2,n-1 do i=2,n-1 unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) ) enddo enddo !$omp target teams distribute parallel do do j=2,n-1 do i=2,n-1 u(i,j) = unew(i,j) enddo enddo enddo call cpu_time( time2 ) call system_clock( cnt2 ) do j =1,n write( 10, '(*(f10.4))' ) u(:,j) enddo write(*,*) 'CPU time: ', time2 - time1 write(*,*) 'Clock time: ', (cnt2 - cnt1) / real(cnt_rate) end program

As you can see, it contains some experiments - the to and tofrom clauses.

Re: Questions regarding GPUs and OCLOC

jimdempseyatthecove — Sat, 08 Jul 2023 18:23:17 GMT

Arjen,

I am inexperienced with GPU programming. I do have some observations on your codeing example.

Your 1st !XXXX commented section (when uncommented) can be thought of as an analog of !$omp parallel (without the DO). IOW it starts the encapsulation of a parallel region (terminated with !$omp end parallel). In this case of target (without teams), the encapsulate code (through !$omp end target) is intended to be offloaded to the GPU.

Your code as written is using !$omp target teams ... within the do k= loop.

Meaning each (of the two) instance specifies both an offload region of code plus an offload teams distribution.

IOW each instance performs a copy in from host to GPU and copy back from GPU to host.

Perhaps a better approach is to place the do k= loop, or a portion of that loop inside the offload region.

... stride = 10 do k=1,1000, stride write(*,*) k !$omp target map(tofrom: u) map(from:unew) do kk=k,min(k+stride-1,1000) !$omp teams distribute parallel do do j=2,n-1 do i=2,n-1 unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) ) enddo enddo !$omp teams distribute parallel do do j=2,n-1 do i=2,n-1 u(i,j) = unew(i,j) enddo enddo enddo !$omp end target end do call cpu_time( time2 ) call system_clock( cnt2 ) ...

Or place the entire k loop inside the target region.

The code above is performs a progress report every 10 steps.

Also, you can use persistent data within the GPU and copy out what is needed when needed.

I wish the documentation included examples, or links to examples of the various offload features.

Jim Dempsey

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Sun, 09 Jul 2023 10:35:46 GMT

That is pretty much what I had in mind: copy the data to the GPU, do all the calculations there and then bring back the results. It is indeed the lack of clear examples that makes it an exercise in patience, trial and error. At the very least I am glad I could establish that the GPU is recognised and is doing the work I wanted it to do. Now I need to figure out what the right invocation is to make it worthwhile - there are quite a few permutations possible. I will look into your suggestions, thanks.

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Sun, 09 Jul 2023 11:03:16 GMT

I wrote a Fortran OpenMP offload tutorial that will be published in the oneAPI Samples GitHub when oneAPI 2023.2 is released later this month. It's based on a matrix multiply.

This is what the code should look like when the tutorial is complete. The size of the matrix may need to be changed depending on the memory available in the GPU.

ARGH! The ELSE for line 13 doesn't show. You'll need that.

program matrix_multiply use omp_lib implicit none integer :: i, j, k, myid, m, n real(8), allocatable, dimension(:,:) :: a, b, c, c_serial n = 2600 myid = OMP_GET_THREAD_NUM() if (myid .eq. 0) then print *, 'matrix size ', n print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS() print *, 'Number of OpenMP Device Available:', omp_get_num_devices() !$omp target if (OMP_IS_INITIAL_DEVICE()) then print *, ' Running on CPU' else print *, ' Running on GPU' endif !$omp end target endif allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n)) ! Initialize matrices do j=1,n do i=1,n a(i,j) = i + j - 1 b(i,j) = i - j + 1 enddo enddo c = 0.0 c_serial = 0.0 !$omp target teams map(to: a, b) map(tofrom: c) !$omp distribute parallel do SIMD private(j, i, k) ! parallel compute matrix multiplication. do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$omp end target teams ! serial compute matrix multiplication do j=1,n do i=1,n do k=1,n c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j) enddo enddo enddo ! verify result do j=1,n do i=1,n if (c_serial(i,j) .ne. c(i,j)) then print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j) exit endif enddo enddo print *,'PASSED' end program matrix_multiply

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Sun, 09 Jul 2023 11:05:29 GMT

If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Sun, 09 Jul 2023 11:08:38 GMT

Ah, thanks - again, I will study this. It will certainly be worthwhile to see the directives in action.

Re: Questions regarding GPUs and OCLOC

jimdempseyatthecove — Sun, 09 Jul 2023 12:55:46 GMT

>>If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?

Perhaps Barbara can correct me should I be wrong

!$omp target (without teams), through !$omp end target...

can be use to install code and data into GPU (or reuse code from prior use) .AND. begin a serial sequence within the GPU.

Within the above target region, the use of !$omp teams distribute can be used to to form a team and distribute a DO loop(s) for a parallel processing of the loop(s).

In Barbara's example, she is using !$omp target teams ... to enter the offload region with a parallel team running, and then within the parallel offload region use !$omp distribute to partition the DO loop.

Jim Dempsey

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Mon, 10 Jul 2023 07:13:32 GMT

This shows the importanc of examples :). I hope that the tutorial I found will bring me the correct understanding.

In the meantime, things are not entirely straightforward. Jim's suggestion leads to the following error messages from the compiler:

Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320 Copyright (C) 1985-2023 Intel Corporation. All rights reserved. diffu_v5_jim.f90(39): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct. do kk=k,min(k+stride-1,1000) --------^ diffu_v5_jim.f90(46): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct. !$omp teams distribute parallel do ------------------^ diffu_v5_jim.f90(52): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct. enddo --------^ compilation aborted for diffu_v5_jim.f90 (code 1)

whereas Barbara's example leads to run-time errors:

matrix size 2600 Number of CPU procs is 1 Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information. Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime. Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Mon, 10 Jul 2023 10:50:41 GMT

Some OpenMP offload references:

Three Quick, Practical Examples of OpenMP Offload to GPUs (Intel webinar)
GPU Offloading: The Next Chapter for Intel® Fortran Compiler (Intel webinar)
Using OpenMP—The Next Step: Affinity, Accelerators, Tasking, and SIMD (book)
Examples from openmp.org. Search for TARGET.

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Mon, 10 Jul 2023 10:55:01 GMT

>> whereas Barbara's example leads to run-time errors:

Can you try a smaller matrix? I had to do that for one Intel GPU I tested.

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Mon, 10 Jul 2023 11:10:40 GMT

I reduced the size by a factor 10 and got a very similar error message. I am also a trifle puzzled by the statement that there is only one CPU. A "hello" program clearly showed 24 CPUs (or better: a defualt of 24 threads being started).

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Mon, 10 Jul 2023 11:36:16 GMT

With the matmul, there is only 1 CPU because the OpenMP directives are all for TARGET, not CPU. You can modify the directives to run on CPU only

I just copied what I posted, added the "else" and ran it successfully on Linux with PVC. However, the output didn't say I ran on GPU. So I removed the "else" and got this.

+ a.out matrix size 2600 Number of CPU procs is 1 Number of OpenMP Device Available: 2 Running on GPU PASSED

My next step is to load the GPU driver on my laptop and see what's up.

I also set the environment variable LIBOMPTARGET_PLUGIN_PROFILE=T and got this output because I set that.

====================================================================================================================== LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Data Center GPU Max 1100, Thread 0 ---------------------------------------------------------------------------------------------------------------------- Kernel 0 : __omp_offloading_45_810c026c_MAIN___l14 Kernel 1 : __omp_offloading_45_810c026c_MAIN___l35 ---------------------------------------------------------------------------------------------------------------------- : Host Time (msec) Device Time (msec) Name : Total Average Min Max Total Average Min Max Count ---------------------------------------------------------------------------------------------------------------------- Compiling : 421.63 421.63 421.63 421.63 0.00 0.00 0.00 0.00 1.00 DataAlloc : 3.11 0.22 0.00 0.81 0.00 0.00 0.00 0.00 14.00 DataRead (Device to Host) : 0.00 0.00 0.00 0.00 2.38 2.38 2.38 2.38 1.00 DataWrite (Host to Device): 4.39 0.49 0.01 1.71 7.39 0.82 0.00 2.47 9.00 Kernel 0 : 1.87 1.87 1.87 1.87 0.01 0.01 0.01 0.01 1.00 Kernel 1 : 0.07 0.07 0.07 0.07 2865.43 2865.43 2865.43 2865.43 1.00 Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 OffloadEntriesInit : 1.51 1.51 1.51 1.51 0.00 0.00 0.00 0.00 1.00 ======================================================================================================================

Re: Questions regarding GPUs and OCLOC

Arjen_Markus — Mon, 10 Jul 2023 13:16:22 GMT

I removed the "else" statement and got the message that one OpenMP device is available. And then the error message.

With one version of the original diffusion program I get the following profile:

Start time loop ... CPU time: 10.23438 Clock time: 10.23900 ====================================================================================================================== LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) UHD Graphics 770, Thread 0 ---------------------------------------------------------------------------------------------------------------------- Kernel 0 : __omp_offloading_f2b3f24a_efe1e_MAIN___l34 Kernel 1 : __omp_offloading_f2b3f24a_efe1e_MAIN___l40 ---------------------------------------------------------------------------------------------------------------------- : Host Time (msec) Device Time (msec) Name : Total Average Min Max Total Average Min Max Count ---------------------------------------------------------------------------------------------------------------------- Compiling : 604.39 604.39 604.39 604.39 0.00 0.00 0.00 0.00 1.00 DataAlloc : 1.49 0.00 0.00 0.05 0.00 0.00 0.00 0.00 8008.00 DataRead (Device to Host) : 797.53 0.20 0.16 0.90 0.00 0.00 0.00 0.00 4000.00 DataWrite (Host to Device): 902.03 0.09 0.00 1.66 0.00 0.00 0.00 0.00 10000.00 Kernel 0 : 2758.64 2.76 2.51 11.12 2582.38 2.58 2.43 3.55 1000.00 Kernel 1 : 5127.18 5.13 4.83 6.22 4958.39 4.96 4.75 6.06 1000.00 Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 OffloadEntriesInit : 12.91 12.91 12.91 12.91 0.00 0.00 0.00 0.00 1.00 ======================================================================================================================

Re: Questions regarding GPUs and OCLOC

Barbara_P_Intel — Mon, 10 Jul 2023 18:50:45 GMT

That's great that you have something that runs on the GPU! That table is proof! So it seems your environment is set up.

I don't know why the matmul is failing. I just ran it successfully on an Intel Core i7-8809G 3.10GHz with Intel® HD Graphics 630. An older machine, but it worked.