- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I am writing in SYCL for the first time and ported my C++ sequential code into a kernel-based SYCL code. I see a difference in the order of 1e-17 between the sequential and sycl implementation. The compiler flags I have used are **-g -O0 -fp-model=precise -fimf-arch-consistency=true -no-fma. **I use icpx.

Now I am quite sure that my code is correct but the math libraries between the two do produce different results- This raises the question:

1. The sequential uses the math.h c library while sycl uses the sycl:: math library. Can I expect bit-identical results with no optimization? (if not-why?)

2. Even if the sycl vs sequential does not produce bit identical results, shouldn't the cpu vs the gpu implementation have bit identical results? It does not for me!

3. Even though machine precision for a double is only 1e-15 and all the other decimals are garbage, how can I force the CPU registers to use only 64-bit registers to get these identical results? We had a similar issue with a Kokkos vs sequential implementation for GNU and the -mpc64 flag helped. However, with icpx, I have already tried -pc64, -mdouble -mfp64 and these flags are (btw) not recognized.

I have attached a very small snippet of my seq vs sycl code just to give you an idea.

**Sequential**

TARGET real_t cloud_to_rain(real_t t, real_t qc, real_t qr, real_t nc) {

const real_t au_kernel =

x1 / (static_cast<real_t>(20.0) * x2) * (x3 + static_cast<real_t>(2.0)) *

(x3 + static_cast<real_t>(4.0)) /

pow((x3 + static_cast<real_t>(1.0)), static_cast<real_t>(2.0));

real_t result = 0.0;

if (qc > qmin_ac && t > graupel_ct::tfrz_hom) {

real_t tau = fmax(tau_min, fmin(static_cast<real_t>(1.0) - qc / (qc + qr),

tau_max)); // time-scale

real_t phi = pow(tau, b_phi); // similarity function for autoconversion

phi = a_phi * phi *

pow((static_cast<real_t>(1.0) - phi), static_cast<real_t>(3.0));

real_t xau = au_kernel * pow(qc * qc / nc, static_cast<real_t>(2.)) *

(static_cast<real_t>(1.0) +

phi / pow(static_cast<real_t>(1.0) - tau,

static_cast<real_t>(2.0))); // autoconversion rate

real_t xac =

ac_kernel * qc * qr *

pow((tau / (tau + c_phi)), static_cast<real_t>(4.0)); // accretion rate

result = xau + xac;

}

return result;

}

**SYCL**

TARGET real_t cloud_to_rain(real_t t, real_t qc, real_t qr, real_t nc) {

const real_t au_kernel =

x1 / (static_cast<real_t>(20.0) * x2) * (x3 + static_cast<real_t>(2.0)) *

(x3 + static_cast<real_t>(4.0)) /

sycl::pow((x3 + static_cast<real_t>(1.0)), static_cast<real_t>(2.0));

real_t result = 0.0;

if (qc > qmin_ac && t > graupel_ct::tfrz_hom) {

real_t tau = sycl::fmax(tau_min, sycl::fmin(static_cast<real_t>(1.0) - qc / (qc + qr),

tau_max)); // time-scale

real_t phi = sycl::pow(tau, b_phi); // similarity function for autoconversion

phi = a_phi * phi *

sycl::pow((static_cast<real_t>(1.0) - phi), static_cast<real_t>(3.0));

real_t xau = au_kernel * sycl::pow(qc * qc / nc, static_cast<real_t>(2.)) *

(static_cast<real_t>(1.0) +

phi / sycl::pow(static_cast<real_t>(1.0) - tau,

static_cast<real_t>(2.0))); // autoconversion rate

real_t xac =

ac_kernel * qc * qr *

sycl::pow((tau / (tau + c_phi)), static_cast<real_t>(4.0)); // accretion rate

result = xau + xac;

}

return result;

}

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@Harshada in order for us to investigate this issue further, we need a runnable test case. Can you please supply a reproducer so we can compile, execute, and see the results you mentioned? Please also add exact reproducing steps.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

https://github.com/habalasub/sycl_math_library_bug

I have attached a link to a public GitHub repository with all the instructions on how to compile.

Additionally, if the link is not accessible, PFA below.

Steps to reproduce the error:

Initialise the oneAPI environment on your machine.

Run the code (example below): The code was run using the following flags since the purpose is precision, not performance.

icpx -fsycl -fp-model=precise -O0 power.cpp -o power ./power $INPUT

where $INPUT is the number of elements to be tested on.

Note: The code can be run with both floats and doubles by manually changing the value of 'real_t' in the code.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm escalating your issue to our internal team and will work on it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I used two different machines:

sdp714857:~/power$ ./power 80

Device name: Intel(R) Xeon(R) Platinum 8480+

Total data volume: 640 bytes.

-------------- POW FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 4.44089e-16

Maximum difference between SYCL and CPU(STD): 4.44089e-16

-------------- EXP FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 0

Maximum difference between SYCL and CPU(STD): 0

sdp4450:~/power$ ./power 80

Device name: Intel(R) Data Center GPU Max 1550

Total data volume: 640 bytes.

-------------- POW FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 1.52588e-05

Maximum difference between SYCL and CPU(STD): 1.52588e-05

-------------- EXP FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 0.000976562

Maximum difference between SYCL and CPU(STD): 0.000976562

Which icx compiler did you use?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I would use more than 80 to verify my claims as you can see in the example below! This was on my local Desktop for single precision. I used Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)

./power 8000

Device name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz

Total data volume: 64000 bytes.

-------------- POW FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 7.62939e-06

Maximum difference between SYCL and CPU(STD): 7.62939e-06

-------------- EXP FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 0.00195312

Maximum difference between SYCL and CPU(STD): 0.00195312

./power 80

Device name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz

Total data volume: 640 bytes.

-------------- POW FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 4.44089e-16

Maximum difference between SYCL and CPU(STD): 4.44089e-16

-------------- EXP FUNCTION ----------------------

Maximum difference between SYCL and CPU(SYCL): 0

Maximum difference between SYCL and CPU(STD): 0

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

While the DPC++ compiler aims to support the SYCL specification, there are inherent differences in floating-point control and library function implementations across devices, leading to potential variations in computational results. We pass most fp control settings to the device compiler, but the differences are that the IMF (Intel Math Kernel Library) controls aren’t supported for SYCL and non-CPU OpenMP, and fp-model strict isn’t supported for device compilation. Different devices, including CPUs, may use different implementations of library functions compared to the host compiler, leading to inconsistencies in numeric results across devices. When fp-model fast is used, device compilers may select less accurate implementations of math functions. Although using fp-model precise ensures math functions meet the accuracy requirements of the OpenCL SPIR-V Environment specification, variations can still occur within allowed accuracy levels.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page