- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the following precision loss expected from the OpenCL plugin?
$ ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy
$ OMP_DEFAULT_DEVICE=1 ./opencl_accuracy
ap 2.080958
a 1.049477
z 1.000000
CPU cons .3576279E-05
GPU cons .3814697E-05
CPU cons 0XF.P-22
GPU cons 0X8.P-21
program opencl_accuracy
implicit none
integer :: n_a, i
real :: R, ap, a, z, h, cons_cpu, cons_gpu
n_a = 500
R = 1.03e0
h = (4.0 - 0.01)/real(n_a - 1)
ap = 0.01
a = 0.01
do i = 2, 260
ap = ap + h
end do
do i = 2, 131
a = a + h
end do
z = 1.0000
print *, "ap", ap
print *, "a ", a
print *, "z ", z
cons_cpu = R*a + z - ap
!$omp target map(from: cons_gpu)
cons_gpu = R*a + z - ap
!$omp end target
write(*,'(A,G0)') "CPU cons ", cons_cpu
write(*,'(A,G0)') "GPU cons ", cons_gpu
write(*,'(A,EX0.0)') "CPU cons ", cons_cpu
write(*,'(A,EX0.0)') "GPU cons ", cons_gpu
end program
The device used is Intel(R) UHD Graphics 750:
Device Name Intel(R) UHD Graphics 750
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 3.0 NEO
Driver Version 24.05.28454.6
Device OpenCL C Version OpenCL C 1.2
The compiler version is ifx (IFX) 2024.0.2 20231213.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use -fp-model precise to control the default behavior of the GPU computation:
ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy
export LIBOMPTARGET_PLUGIN_PROFILE=T
$ ./opencl_accuracy
ap 2.080958
a 1.049477
z 1.000000
CPU cons .3576279E-05
GPU cons .3814697E-05
CPU cons 0XF.P-22
GPU cons 0X8.P-21
=====================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
---------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_3a_4bd7e1fe_MAIN___l26
---------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
---------------------------------------------------------------------------------------------------------------------
Compiling : 258.30 258.30 258.30 258.30 0.00 0.00 0.00 0.00 1.00
DataAlloc : 2.50 0.28 0.00 2.40 0.00 0.00 0.00 0.00 9.00
DataRead (Device to Host): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Kernel 0 : 11.79 11.79 11.79 11.79 0.03 0.03 0.03 0.03 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 2.86 2.86 2.86 2.86 0.00 0.00 0.00 0.00 1.00
=====================================================================================================================
$ ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy -fp-model precise
$ ./opencl_accuracy
ap 2.080958
a 1.049477
z 1.000000
CPU cons .3576279E-05
GPU cons .3576279E-05
CPU cons 0XF.P-22
GPU cons 0XF.P-22
=====================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
---------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_3a_4bd7e1fe_MAIN___l26
---------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
---------------------------------------------------------------------------------------------------------------------
Compiling : 253.34 253.34 253.34 253.34 0.00 0.00 0.00 0.00 1.00
DataAlloc : 0.06 0.01 0.00 0.02 0.00 0.00 0.00 0.00 9.00
DataRead (Device to Host): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Kernel 0 : 11.61 11.61 11.61 11.61 0.03 0.03 0.03 0.03 1.00
Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 2.69 2.69 2.69 2.69 0.00 0.00 0.00 0.00 1.00
=====================================================================================================================
It's probably not obvious, but the GPU code uses a different compiler than ifx. The ifx driver will invoke that GPU compiler, IGC, and pass it options. Probably like me you assumed -O0 applied to the GPU code as well as the CPU code. Not so, obviously. Let me consult with the ifx driver team and the IGC people to see how we can tell the GPU code compiler to use -O0 or fp-model disjoint from the ifx CPU compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ron, it's good to know it's possible to control the floating point model also on the GPU, although not obvious to me from the start. I was aware of the IGC, as I had to install it separately, and also tested it with some OpenCL programs.
Querying the OpenCL device properties I noticed that divide and sqrt aren't rounded according to IEEE rules,
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
I wasn't if or how this carries over to OpenMP offloading too? I noticed the variable LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS can be used to pass options to the OpenCL compiler (https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#compiler-options).
It would be helpful for Fortran programmers, if some information about controlling the rounding behavior when using OpenMP offloading was incorporated into the Intel Fortran compiler documentation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this specific problem we do think we found an error or bug in our Fortran Front-end. For -O0 we should be preventing the gpu compiler from doing optimizations like no infinities, no fma, no contractions, no reassociation. The option -fp-model can control these, but I think -O0 should infer all those optimizations are not to be used. There are 2 ways to fix this. One is in the driver code. The other in the Front-end. We are debating which path is best for the long run.
I'm opening a bug report on this. I'll have the bug ID shortly, after I roll up my report to the developers.
As for the documentation - agreed. FP control of kernels should be documented. I'll open a Documentation Feature Request for this topic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps the following page is the amendment requested?
oneAPI GPU Optimization Guide - Accuracy versus Performance Tradeoffs in Floating-Point Computations:
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/accuracy-versus-performance-tradeoffs-in-floating.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
btw - in my investigation, I found what I think is the most explicit compiler option to control the GPU computation for this specific example
-fp-model source
which says to evaluate expressions using standard Fortran expression evaluation rules/ordering. This I think is more explicit than something vague like "fp-model precise" which has a whole bucket list of actions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the bug ID for this issue is CMPLRLLVM-57897

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page