OpenCL 1.2 and Floating Point Precision

dornas__joao · ‎02-06-2019

Dear Friends,

I am using the following hardware/software:

1. Intel GPU Iris Pro Graphics 5200

2. C++ (Visual Studio 2017) with Intel OpenCL SDK 2.0

3. MATLAB 2018

I have a doubt about my precision limits using this hardware. I know from its documentation that it supports only Compute Capability 1.2, which has more errors rounding floating points than other versions of Compute Capability (eg.: 2.0).

When I compute a Covariance matrix inside GPU (using C++/OpenCL) and compare to the same computation, using the same data and equation, done inside the CPU (using MATLAB), I get a mean error of around 10^(-9).

But when I compute a Matrix Inverse inside GPU and compare to the same computation, inside CPU, the error is around 10^(-2). And this is too big to give the same result at the final end of all computations.

I am using a Gauss-Jordan method to invert a matrix of size around 10(^4) cells.

Anybody has any experience with this situation which could help on how to solve the floating point precision problem?

thank you very much,

best regards,

Joao V. Dornas

Michael_C_Intel1 · ‎04-17-2019

Hi JoaoD,

Thanks for the question and the interest... Your question can branch off into a few different directions, so I want to add some heuristic approaches and references in a response here.

Floating point error can vary by:

dynamic range of floats in use
order of operands / operations
device and microarchitecture
memory alignment of floating point data (which can affect order of operations)

I don't see a good order to address these bullet points... so I'll start from the top.

If a floating point operand's exponent is significantly different than another's it may not be able to be represented completely during the operation. Did you try double precision on either the host side or device side? Both?
I haven't been too engrossed with that algorithm personally since school, but upon quick review it appears to not demand a consistent order of operations on it's face... i.e. it can add before it multiplies or multiply before it adds... This order can affect error propagation. Off the shelf software may marshall operations differently than a by hand program.
- If in source the multiply part is sufficiently separated apart from the addition part, this could result in a different instruction sequence.... yielding differing error propagation.
  - In x87 speak such a sequence of operations would be accomplished by two instructions. On processors with newer ISA features a fused multiply add instruction can further accelerate but perhaps at the cost of accuracy.
Compilers may alter the order of operands (see fp:precise). This can alter how errors are propagated. Particularly in operations packed into vector instructions.
Compilers on host may target vector instructions of different fixed widths... this may mean partitioning an algorithm into multiple of these instructions, which can in turn be affected due to memory alignment. For this reason a solution may be to align memory allocations along the page size boundary. Differing devices have differing vector width instruction sets. Different instructions in different orders mean different error propagation.
In the OpenCL™ on Intel® GEN Graphics uArch context:
- The architecture can leverage differing vector widths at runtime discretion. The OpenCL™ runtime and dispatch process has some flexibility with how it schedules work on the device. Again, this can lead to erratic error propagation.
  - There really aren't avenues to control vector widths directly at this time. For this reason operating from aligned memory could also help here... and keeping consistently behaving compiler options.
- Consider evaluating standard OpenCL compiler build options clBuildProgram(...) as well and -bo="" from the offline compiler, 'ioc64', included with the SDK kit.
  - See ioc64 --help for more information on sending build options through to the kernel.
  - IOC or the OpenCL™ API can show a build log that might might be useful to monitor as part of your development (ioc64 -output / clGetProgramBuildInfo(...) ).
Off the shelf compute solutions like you've mentioned, or like Intel® MKL, have developers spending much effort with compiler and architecture experts to understand performance versus numerical consistency trade offs and expectations... In effect, a by hand program may need to replicate some or all of that effort to see desired results...
The program could be attempting to compute with denormal values. This could increase error by orders of magnitude depending on algorithm.
To keep it simple... some recommendations:
- Ensure that the algorithm, input, and intermediate datatypes are the same at each step of each host program and each kernel program. Try double precision?
- Use aligned allocation for any floating point data... Especially that data that *could* get packed into vector instructions.
- Then, ensure higher machine floating point precision is in use if possible toggles like fp:precise or simplifying clBuildProgram(...) / ioc64 -bo= build options in use.
  - Also, leverage compiler optimization toggles to ensure reproducibility of results on both the host and device. Does disabling optimization help?
- Is one device giving denormals during compute? Is there anyway that can be avoiding by filtering input data?

References:

Extra:

Check out Intel® MKL which has CPU side operations optimized for finding matrix inverses.

-MichaelC