- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to experiment a bit with GPUs, but I am getting lost wrt the actual hardware and the support from ifx. Here is the situation:
I work with a laptop running Windows. According to the task manager it has two GPUs, Intel UHD Graphics and NVIDIA RTX A1000 Laptop GPU. I have no idea if ifx supports the first (the second certainly is not supported). So, I try to build a program that exploits GPUs via OpenMP offloading. So far so good.
The option -Qopenmp-targets:spir64 does have an effect, in that with the environment variable LIBOMPTARGET_DEBUG set to 1 I get a lot of debugging information. If I unset that variable the program hangs and after an interruption via control-C, I get the message:
forrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source
KERNELBASE.dll 00007FFE76522943 Unknown Unknown Unknown
KERNEL32.DLL 00007FFE76FB7614 Unknown Unknown Unknown
ntdll.dll 00007FFE787E26F1 Unknown Unknown Unknown
Libomptarget error: Host ptr 0x00007ff6f3f795ec does not have a matching target pointer.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
My interpretation is that the Intel GPU is not actually used or cannot be connected or is simply not supported. WEll, that can happen. But looking for an alternative (or better: looking for the list of devices that are supported), I cane across the option -Qopenmp-targets:spir64_gen.
If I try that, I get:
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
ifx: warning #10441: The OpenCL offline compiler could not be found and is required for AOT compilation.See "https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html" for more information.
ifx: error #10037: could not find 'ocloc'
ifx: error #10401: error running 'Offline Compiler'
So I try to find out how to get ocloc. For Windows it ought to be part of the Intel DPC++/C++ installation. As far as I can tell from the output of icx on my laptop, that has been installed:
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
icx: error: no input files
But I cannot find a program "ocloc.exe" on the laptop. Or anything that resembles that name.
So I am left with a couple of questions:
- Is the Intel GPU I have supported by ifx or icx?
- What do I need to do to get ocloc and thereby enable "spir64_gen", if that would be a solution?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> Intel® HD Graphics 630
That "GPU" has native support for double precision, many other GPU's do not. IIF the simulate douple precision via software, then expect a slowdown. IIF they do not have software support for double precision, then they should report this if failing.
Arjen, you could experiment with setting your real's to real(4). If this works, then it indicates lack of (simulated) double precision.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you install the Intel GPU device driver? Information on how to do that is in the System Requirements article. Supported Intel CPUs are listed with the driver information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, to be sure (I did do an explicit update before) I followed the instructions from that page and hoped I got the right one, as none of the entries lists exactly the GPU my system apparently has. But that was unsuccessful in the sense that I get the same sort of error. The program hangs and upon control-C I get similar messages.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Judging from the debug output I would say it is working:
Libomptarget --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --> Init Level0 plugin!
Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --> omp_get_max_teams() returned 0
Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --> Looking for Level0 devices...
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) UHD Graphics 770
Target LEVEL0 RTL --> Found 1 root devices, 1 total devices.
Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --> -- 0
Target LEVEL0 RTL --> Root Device Information
Target LEVEL0 RTL --> Device 0
Target LEVEL0 RTL --> -- Name : Intel(R) UHD Graphics 770
Target LEVEL0 RTL --> -- PCI ID : 0x4688
and lots more, but I get no output and I have to terminate the program because I am running out of patience (the same program with classic OpenMP statement runs in half a second, the program without any OpenMP takes several seconds).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh, I misinterpreted the behaviour of the program! I added a write statement to the time loop (the body of that loop is in the target section) and I see that this is running, be it very slow. The task manager indeed indicates that the GPU is doing a lot of work, but I have succeeded in slowing down the program by at least a factor 100. Not entirely the result I expected :).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you can set the env var
to get an idea of how much data movement is occuring, how much time in the kernel, etc.
Is the kernel small enough to share? For DO loops, did you use
!$omp target teams distribute parallel do
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, I am still learning how exactly to use the various keywords. I know my way around the classic OpenMP keywords, but these are new and I have to experiment. So I did. Using the keywords you suggested does improive the performance of the program, but it is still very much slower than the sequential version. I have copied the code below (attaching did not work
! diffu.f90 --
! Solve a diffusion-reaction equation: nabla2 u = alpha * exp(u)
! Note:
! The program is much slower than without OpenMP offloading. This clearly
! requires more fine-tuning.
program diffu
use omp_lib
implicit none
real, allocatable :: u(:,:), unew(:,:)
real :: alpha, delt
integer :: i, j, k, n
real :: time1, time2
integer :: cnt1, cnt2, cnt_rate
open( 10, file = 'diffu.out' )
n = 1280
allocate( u(n,n), unew(n,n) )
delt = 0.1
alpha = 0.01
u = 0.0
!! u(1,:) = 1.0
!! call omp_set_num_threads(128)
call system_clock( cnt1, cnt_rate )
call cpu_time( time1 )
write(*,*) 'Start time loop ...'
do k = 1,1000
!! write(*,*) k
!XXXX !$omp target map(tofrom: u) map(from:unew)
!XXXX !$omp teams
!$omp target teams distribute parallel do
do j=2,n-1
do i=2,n-1
unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
!$omp target teams distribute parallel do
do j=2,n-1
do i=2,n-1
u(i,j) = unew(i,j)
call cpu_time( time2 )
call system_clock( cnt2 )
do j =1,n
write( 10, '(*(f10.4))' ) u(:,j)
write(*,*) 'CPU time: ', time2 - time1
write(*,*) 'Clock time: ', (cnt2 - cnt1) / real(cnt_rate)
end program
As you can see, it contains some experiments - the to and tofrom clauses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am inexperienced with GPU programming. I do have some observations on your codeing example.
Your 1st !XXXX commented section (when uncommented) can be thought of as an analog of !$omp parallel (without the DO). IOW it starts the encapsulation of a parallel region (terminated with !$omp end parallel). In this case of target (without teams), the encapsulate code (through !$omp end target) is intended to be offloaded to the GPU.
Your code as written is using !$omp target teams ... within the do k= loop.
Meaning each (of the two) instance specifies both an offload region of code plus an offload teams distribution.
IOW each instance performs a copy in from host to GPU and copy back from GPU to host.
Perhaps a better approach is to place the do k= loop, or a portion of that loop inside the offload region.
stride = 10
do k=1,1000, stride
write(*,*) k
!$omp target map(tofrom: u) map(from:unew)
do kk=k,min(k+stride-1,1000)
!$omp teams distribute parallel do
do j=2,n-1
do i=2,n-1
unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
!$omp teams distribute parallel do
do j=2,n-1
do i=2,n-1
u(i,j) = unew(i,j)
!$omp end target
end do
call cpu_time( time2 )
call system_clock( cnt2 )
Or place the entire k loop inside the target region.
The code above is performs a progress report every 10 steps.
Also, you can use persistent data within the GPU and copy out what is needed when needed.
I wish the documentation included examples, or links to examples of the various offload features.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is pretty much what I had in mind: copy the data to the GPU, do all the calculations there and then bring back the results. It is indeed the lack of clear examples that makes it an exercise in patience, trial and error. At the very least I am glad I could establish that the GPU is recognised and is doing the work I wanted it to do. Now I need to figure out what the right invocation is to make it worthwhile - there are quite a few permutations possible. I will look into your suggestions, thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a Fortran OpenMP offload tutorial that will be published in the oneAPI Samples GitHub when oneAPI 2023.2 is released later this month. It's based on a matrix multiply.
This is what the code should look like when the tutorial is complete. The size of the matrix may need to be changed depending on the memory available in the GPU.
ARGH! The ELSE for line 13 doesn't show. You'll need that.
program matrix_multiply
use omp_lib
implicit none
integer :: i, j, k, myid, m, n
real(8), allocatable, dimension(:,:) :: a, b, c, c_serial
n = 2600
if (myid .eq. 0) then
print *, 'matrix size ', n
print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target
print *, ' Running on CPU'
print *, ' Running on GPU'
!$omp end target
allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))
! Initialize matrices
do j=1,n
do i=1,n
a(i,j) = i + j - 1
b(i,j) = i - j + 1
c = 0.0
c_serial = 0.0
!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
do j=1,n
do i=1,n
do k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
!$omp end target teams
! serial compute matrix multiplication
do j=1,n
do i=1,n
do k=1,n
c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
! verify result
do j=1,n
do i=1,n
if (c_serial(i,j) .ne. c(i,j)) then
print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
print *,'PASSED'
end program matrix_multiply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, thanks - again, I will study this. It will certainly be worthwhile to see the directives in action.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My main computer is chugging along on a FEM problem, so I used the old NUC core i3 to solve this problem.
it hangs are line 42 with a c is not allocated
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?
Perhaps Barbara can correct me should I be wrong
!$omp target (without teams), through !$omp end target...
can be use to install code and data into GPU (or reuse code from prior use) .AND. begin a serial sequence within the GPU.
Within the above target region, the use of !$omp teams distribute can be used to to form a team and distribute a DO loop(s) for a parallel processing of the loop(s).
In Barbara's example, she is using !$omp target teams ... to enter the offload region with a parallel team running, and then within the parallel offload region use !$omp distribute to partition the DO loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This shows the importanc of examples :). I hope that the tutorial I found will bring me the correct understanding.
In the meantime, things are not entirely straightforward. Jim's suggestion leads to the following error messages from the compiler:
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
diffu_v5_jim.f90(39): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
do kk=k,min(k+stride-1,1000)
diffu_v5_jim.f90(46): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
!$omp teams distribute parallel do
diffu_v5_jim.f90(52): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
compilation aborted for diffu_v5_jim.f90 (code 1)
whereas Barbara's example leads to run-time errors:
matrix size 2600
Number of CPU procs is 1
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> whereas Barbara's example leads to run-time errors:
Can you try a smaller matrix? I had to do that for one Intel GPU I tested.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I reduced the size by a factor 10 and got a very similar error message. I am also a trifle puzzled by the statement that there is only one CPU. A "hello" program clearly showed 24 CPUs (or better: a defualt of 24 threads being started).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the matmul, there is only 1 CPU because the OpenMP directives are all for TARGET, not CPU. You can modify the directives to run on CPU only
I just copied what I posted, added the "else" and ran it successfully on Linux with PVC. However, the output didn't say I ran on GPU. So I removed the "else" and got this.
+ a.out
matrix size 2600
Number of CPU procs is 1
Number of OpenMP Device Available: 2
Running on GPU
My next step is to load the GPU driver on my laptop and see what's up.
I also set the environment variable LIBOMPTARGET_PLUGIN_PROFILE=T and got this output because I set that.
Kernel 0 : __omp_offloading_45_810c026c_MAIN___l14
Kernel 1 : __omp_offloading_45_810c026c_MAIN___l35
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
Compiling : 421.63 421.63 421.63 421.63 0.00 0.00 0.00 0.00 1.00
DataAlloc : 3.11 0.22 0.00 0.81 0.00 0.00 0.00 0.00 14.00
DataRead (Device to Host) : 0.00 0.00 0.00 0.00 2.38 2.38 2.38 2.38 1.00
DataWrite (Host to Device): 4.39 0.49 0.01 1.71 7.39 0.82 0.00 2.47 9.00
Kernel 0 : 1.87 1.87 1.87 1.87 0.01 0.01 0.01 0.01 1.00
Kernel 1 : 0.07 0.07 0.07 0.07 2865.43 2865.43 2865.43 2865.43 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 1.51 1.51 1.51 1.51 0.00 0.00 0.00 0.00 1.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I removed the "else" statement and got the message that one OpenMP device is available. And then the error message.
With one version of the original diffusion program I get the following profile:
Start time loop ...
CPU time: 10.23438
Clock time: 10.23900
Kernel 0 : __omp_offloading_f2b3f24a_efe1e_MAIN___l34
Kernel 1 : __omp_offloading_f2b3f24a_efe1e_MAIN___l40
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
Compiling : 604.39 604.39 604.39 604.39 0.00 0.00 0.00 0.00 1.00
DataAlloc : 1.49 0.00 0.00 0.05 0.00 0.00 0.00 0.00 8008.00
DataRead (Device to Host) : 797.53 0.20 0.16 0.90 0.00 0.00 0.00 0.00 4000.00
DataWrite (Host to Device): 902.03 0.09 0.00 1.66 0.00 0.00 0.00 0.00 10000.00
Kernel 0 : 2758.64 2.76 2.51 11.12 2582.38 2.58 2.43 3.55 1000.00
Kernel 1 : 5127.18 5.13 4.83 6.22 4958.39 4.96 4.75 6.06 1000.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 12.91 12.91 12.91 12.91 0.00 0.00 0.00 0.00 1.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's great that you have something that runs on the GPU! That table is proof! So it seems your environment is set up.
I don't know why the matmul is failing. I just ran it successfully on an Intel Core i7-8809G 3.10GHz with Intel® HD Graphics 630. An older machine, but it worked.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some OpenMP offload references:
Three Quick, Practical Examples of OpenMP Offload to GPUs (Intel webinar)
GPU Offloading: The Next Chapter for Intel® Fortran Compiler (Intel webinar)
Using OpenMP—The Next Step: Affinity, Accelerators, Tasking, and SIMD (book)
- Examples from openmp.org. Search for TARGET.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page