Precision loss with OpenMP offload (OpenCL Plugin)

ivanp · ‎04-12-2024

Is the following precision loss expected from the OpenCL plugin?

$ ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy
$ OMP_DEFAULT_DEVICE=1 ./opencl_accuracy 
 ap   2.080958    
 a    1.049477    
 z    1.000000    
CPU cons .3576279E-05
GPU cons .3814697E-05
CPU cons 0XF.P-22
GPU cons 0X8.P-21

program opencl_accuracy
implicit none
integer :: n_a, i
real :: R, ap, a, z, h, cons_cpu, cons_gpu

n_a = 500
R = 1.03e0
h = (4.0 - 0.01)/real(n_a - 1)

ap = 0.01
a = 0.01
do i = 2, 260
   ap = ap + h
end do 
do i = 2, 131
   a = a + h
end do
z = 1.0000

print *, "ap", ap
print *, "a ", a
print *, "z ", z

cons_cpu = R*a + z - ap

!$omp target map(from: cons_gpu)
cons_gpu = R*a + z - ap
!$omp end target

write(*,'(A,G0)') "CPU cons ", cons_cpu
write(*,'(A,G0)') "GPU cons ", cons_gpu

write(*,'(A,EX0.0)') "CPU cons ", cons_cpu
write(*,'(A,EX0.0)') "GPU cons ", cons_gpu
end program

The device used is Intel(R) UHD Graphics 750:

  Device Name                                     Intel(R) UHD Graphics 750
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 3.0 NEO 
  Driver Version                                  24.05.28454.6
  Device OpenCL C Version                         OpenCL C 1.2

The compiler version is ifx (IFX) 2024.0.2 20231213.

Ron_Green · ‎04-16-2024

You can use -fp-model precise to control the default behavior of the GPU computation:

 ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy  

export LIBOMPTARGET_PLUGIN_PROFILE=T
$ ./opencl_accuracy 
 ap   2.080958    
 a    1.049477    
 z    1.000000    
CPU cons .3576279E-05
GPU cons .3814697E-05
CPU cons 0XF.P-22
GPU cons 0X8.P-21
=====================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
---------------------------------------------------------------------------------------------------------------------
Kernel 0                 : __omp_offloading_3a_4bd7e1fe_MAIN___l26
---------------------------------------------------------------------------------------------------------------------
                         : Host Time (msec)                        Device Time (msec)                      
Name                     :      Total   Average       Min       Max     Total   Average       Min       Max     Count
---------------------------------------------------------------------------------------------------------------------
Compiling                :     258.30    258.30    258.30    258.30      0.00      0.00      0.00      0.00      1.00
DataAlloc                :       2.50      0.28      0.00      2.40      0.00      0.00      0.00      0.00      9.00
DataRead (Device to Host):       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
Kernel 0                 :      11.79     11.79     11.79     11.79      0.03      0.03      0.03      0.03      1.00
Linking                  :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit       :       2.86      2.86      2.86      2.86      0.00      0.00      0.00      0.00      1.00
=====================================================================================================================


$ ifx -O0 -fopenmp-targets=spir64 -fiopenmp opencl_accuracy.f90 -o opencl_accuracy -fp-model precise
$ ./opencl_accuracy 
 ap   2.080958    
 a    1.049477    
 z    1.000000    
CPU cons .3576279E-05
GPU cons .3576279E-05
CPU cons 0XF.P-22
GPU cons 0XF.P-22
=====================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) UHD Graphics 630, Thread 0
---------------------------------------------------------------------------------------------------------------------
Kernel 0                 : __omp_offloading_3a_4bd7e1fe_MAIN___l26
---------------------------------------------------------------------------------------------------------------------
                         : Host Time (msec)                        Device Time (msec)                      
Name                     :      Total   Average       Min       Max     Total   Average       Min       Max     Count
---------------------------------------------------------------------------------------------------------------------
Compiling                :     253.34    253.34    253.34    253.34      0.00      0.00      0.00      0.00      1.00
DataAlloc                :       0.06      0.01      0.00      0.02      0.00      0.00      0.00      0.00      9.00
DataRead (Device to Host):       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
Kernel 0                 :      11.61     11.61     11.61     11.61      0.03      0.03      0.03      0.03      1.00
Linking                  :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit       :       2.69      2.69      2.69      2.69      0.00      0.00      0.00      0.00      1.00
=====================================================================================================================

It's probably not obvious, but the GPU code uses a different compiler than ifx. The ifx driver will invoke that GPU compiler, IGC, and pass it options. Probably like me you assumed -O0 applied to the GPU code as well as the CPU code. Not so, obviously. Let me consult with the ifx driver team and the IGC people to see how we can tell the GPU code compiler to use -O0 or fp-model disjoint from the ifx CPU compiler.

ivanp · ‎04-17-2024

Thanks Ron, it's good to know it's possible to control the floating point model also on the GPU, although not obvious to me from the start. I was aware of the IGC, as I had to install it separately, and also tested it with some OpenCL programs.

Querying the OpenCL device properties I noticed that divide and sqrt aren't rounded according to IEEE rules,

Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No

I wasn't if or how this carries over to OpenMP offloading too? I noticed the variable LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS can be used to pass options to the OpenCL compiler (https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#compiler-options).

It would be helpful for Fortran programmers, if some information about controlling the rounding behavior when using OpenMP offloading was incorporated into the Intel Fortran compiler documentation.

Ron_Green · ‎04-17-2024

For this specific problem we do think we found an error or bug in our Fortran Front-end. For -O0 we should be preventing the gpu compiler from doing optimizations like no infinities, no fma, no contractions, no reassociation. The option -fp-model can control these, but I think -O0 should infer all those optimizations are not to be used. There are 2 ways to fix this. One is in the driver code. The other in the Front-end. We are debating which path is best for the long run.

I'm opening a bug report on this. I'll have the bug ID shortly, after I roll up my report to the developers.

As for the documentation - agreed. FP control of kernels should be documented. I'll open a Documentation Feature Request for this topic.

ivanp · ‎09-20-2024

Perhaps the following page is the amendment requested?

oneAPI GPU Optimization Guide - Accuracy versus Performance Tradeoffs in Floating-Point Computations:
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/accuracy-versus-performance-tradeoffs-in-floating.html

Ron_Green · ‎04-17-2024

btw - in my investigation, I found what I think is the most explicit compiler option to control the GPU computation for this specific example

-fp-model source

which says to evaluate expressions using standard Fortran expression evaluation rules/ordering. This I think is more explicit than something vague like "fp-model precise" which has a whole bucket list of actions.

Ron_Green · ‎04-22-2024

the bug ID for this issue is CMPLRLLVM-57897