Re: Different results between icpx vs sycl math library

Harshada · ‎02-14-2024

Hi,

I am writing in SYCL for the first time and ported my C++ sequential code into a kernel-based SYCL code. I see a difference in the order of 1e-17 between the sequential and sycl implementation. The compiler flags I have used are -g -O0 -fp-model=precise -fimf-arch-consistency=true -no-fma. I use icpx.

Now I am quite sure that my code is correct but the math libraries between the two do produce different results- This raises the question:

1. The sequential uses the math.h c library while sycl uses the sycl:: math library. Can I expect bit-identical results with no optimization? (if not-why?)

2. Even if the sycl vs sequential does not produce bit identical results, shouldn't the cpu vs the gpu implementation have bit identical results? It does not for me!

3. Even though machine precision for a double is only 1e-15 and all the other decimals are garbage, how can I force the CPU registers to use only 64-bit registers to get these identical results? We had a similar issue with a Kokkos vs sequential implementation for GNU and the -mpc64 flag helped. However, with icpx, I have already tried -pc64, -mdouble -mfp64 and these flags are (btw) not recognized.

I have attached a very small snippet of my seq vs sycl code just to give you an idea.

Sequential

TARGET real_t cloud_to_rain(real_t t, real_t qc, real_t qr, real_t nc) {
const real_t au_kernel =
x1 / (static_cast<real_t>(20.0) * x2) * (x3 + static_cast<real_t>(2.0)) *
(x3 + static_cast<real_t>(4.0)) /
pow((x3 + static_cast<real_t>(1.0)), static_cast<real_t>(2.0));
real_t result = 0.0;

if (qc > qmin_ac && t > graupel_ct::tfrz_hom) {
real_t tau = fmax(tau_min, fmin(static_cast<real_t>(1.0) - qc / (qc + qr),
tau_max)); // time-scale
real_t phi = pow(tau, b_phi); // similarity function for autoconversion
phi = a_phi * phi *
pow((static_cast<real_t>(1.0) - phi), static_cast<real_t>(3.0));
real_t xau = au_kernel * pow(qc * qc / nc, static_cast<real_t>(2.)) *
(static_cast<real_t>(1.0) +
phi / pow(static_cast<real_t>(1.0) - tau,
static_cast<real_t>(2.0))); // autoconversion rate
real_t xac =
ac_kernel * qc * qr *
pow((tau / (tau + c_phi)), static_cast<real_t>(4.0)); // accretion rate
result = xau + xac;
}

return result;
}

SYCL

TARGET real_t cloud_to_rain(real_t t, real_t qc, real_t qr, real_t nc) {
const real_t au_kernel =
x1 / (static_cast<real_t>(20.0) * x2) * (x3 + static_cast<real_t>(2.0)) *
(x3 + static_cast<real_t>(4.0)) /
sycl::pow((x3 + static_cast<real_t>(1.0)), static_cast<real_t>(2.0));
real_t result = 0.0;

if (qc > qmin_ac && t > graupel_ct::tfrz_hom) {
real_t tau = sycl::fmax(tau_min, sycl::fmin(static_cast<real_t>(1.0) - qc / (qc + qr),
tau_max)); // time-scale
real_t phi = sycl::pow(tau, b_phi); // similarity function for autoconversion
phi = a_phi * phi *
sycl::pow((static_cast<real_t>(1.0) - phi), static_cast<real_t>(3.0));
real_t xau = au_kernel * sycl::pow(qc * qc / nc, static_cast<real_t>(2.)) *
(static_cast<real_t>(1.0) +
phi / sycl::pow(static_cast<real_t>(1.0) - tau,
static_cast<real_t>(2.0))); // autoconversion rate
real_t xac =
ac_kernel * qc * qr *
sycl::pow((tau / (tau + c_phi)), static_cast<real_t>(4.0)); // accretion rate
result = xau + xac;
}

return result;
}

Alex_Y_Intel · ‎04-05-2024

@Harshada in order for us to investigate this issue further, we need a runnable test case. Can you please supply a reproducer so we can compile, execute, and see the results you mentioned? Please also add exact reproducing steps.

Harshada · ‎06-07-2024

https://github.com/habalasub/sycl_math_library_bug

I have attached a link to a public GitHub repository with all the instructions on how to compile.

Additionally, if the link is not accessible, PFA below.

Steps to reproduce the error:

Initialise the oneAPI environment on your machine.
Run the code (example below): The code was run using the following flags since the purpose is precision, not performance.
icpx -fsycl -fp-model=precise -O0 power.cpp -o power ./power $INPUT

where $INPUT is the number of elements to be tested on.

Note: The code can be run with both floats and doubles by manually changing the value of 'real_t' in the code.

Alex_Y_Intel · ‎06-07-2024

I'm escalating your issue to our internal team and will work on it.

Alex_Y_Intel · ‎06-08-2024

I used two different machines:

sdp714857:~/power$ ./power 80
Device name: Intel(R) Xeon(R) Platinum 8480+
Total data volume: 640 bytes.
-------------- POW FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 4.44089e-16
Maximum difference between SYCL and CPU(STD): 4.44089e-16
-------------- EXP FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 0
Maximum difference between SYCL and CPU(STD): 0

sdp4450:~/power$ ./power 80
Device name: Intel(R) Data Center GPU Max 1550
Total data volume: 640 bytes.
-------------- POW FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 1.52588e-05
Maximum difference between SYCL and CPU(STD): 1.52588e-05
-------------- EXP FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 0.000976562
Maximum difference between SYCL and CPU(STD): 0.000976562

Which icx compiler did you use?

Harshada · ‎06-09-2024

I would use more than 80 to verify my claims as you can see in the example below! This was on my local Desktop for single precision. I used Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)

./power 8000

Device name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz
Total data volume: 64000 bytes.
-------------- POW FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 7.62939e-06
Maximum difference between SYCL and CPU(STD): 7.62939e-06
-------------- EXP FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 0.00195312
Maximum difference between SYCL and CPU(STD): 0.00195312

./power 80
Device name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz
Total data volume: 640 bytes.
-------------- POW FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 4.44089e-16
Maximum difference between SYCL and CPU(STD): 4.44089e-16
-------------- EXP FUNCTION ----------------------
Maximum difference between SYCL and CPU(SYCL): 0
Maximum difference between SYCL and CPU(STD): 0

Alex_Y_Intel · ‎06-21-2024

While the DPC++ compiler aims to support the SYCL specification, there are inherent differences in floating-point control and library function implementations across devices, leading to potential variations in computational results. We pass most fp control settings to the device compiler, but the differences are that the IMF (Intel Math Kernel Library) controls aren’t supported for SYCL and non-CPU OpenMP, and fp-model strict isn’t supported for device compilation. Different devices, including CPUs, may use different implementations of library functions compared to the host compiler, leading to inconsistencies in numeric results across devices. When fp-model fast is used, device compilers may select less accurate implementations of math functions. Although using fp-model precise ensures math functions meet the accuracy requirements of the OpenCL SPIR-V Environment specification, variations can still occur within allowed accuracy levels.