Vector processing needs better NAN propagation

AFog0 · ‎03-19-2018

Allow me to start a discussion of NAN propagation, why SIMD code needs it, and some problems that need to be solved.

There are two ways of detecting floating point errors: (1) fault trapping and (2) propagation of INF and NAN.

Fault trapping can cause inconsistent behavior in SIMD code. Consider, for example, an example where we are looping through an array of 16 floats where the last value generates a fault, and we are relying on fault trapping to detect such faults. In linear code, we would have the trap in the last iteration of the loop. If we use ZMM registers, we will have all 16 values in a single register and the trap is happening in the first (and only) iteration of the loop. If the loop has any side effects, for example writing something to a file, then the program will produce different outputs depending on whether the code is vectorized or not. There may be further complications if multiple faults happen in the same vector instruction. Multiple faults in the same instruction generate only one trap, even if the faults are of different types, e.g. 0/0 -> NAN, and 1/0 -> INF. This means that we can have different number of traps depending on the size of vector registers. With bigger vector registers we will have fewer trap events when multiple faults happen simultaneously.

If we want program behavior to be the same for linear and vectorized code - and the same for different vector register sizes - then it may be better to rely on NAN propagation than on fault traps. A further benefit of NAN propagation is that it is faster than fault trapping.

However, there are a few problems with NAN propagation that need to be solved. But first an introduction for those readers who are not familiar with NAN propagation:

Numerical errors such as 0/0 and sqrt(-1) in floating point code generate a special kind of value called NAN (Not a Number). A computation involving a NAN generates a NAN, so that the NAN will propagate through a sequence of calculations to the final result. A NAN can contain a so-called payload which is a sequence of bits that can contain arbitrary information about the error. A calculation error such as 0/0 gives zero payload on most CPUs, but function libraries and custom code may put useful diagnostic information into the payload. The NAN and its payload will propagate through a sequence of calculations to the final result, where it can be detected.

However, there is a problem when two NANs with different payloads are combined, for example in an expression like a+b. The IEEE 754 floating point standard says that when two NANs are combined, the result should be one of the NANs, but the standard does not say which one. Most CPUs, including Intel's, just propagate the first of the two NANs. In other words, NAN1 + NAN2 = NAN1. This is a problem because we do not know which of the two operands the compiler puts first. If one compiler codes an expression as a+b and another compiler makes it b+a then we may get different end results. The result is not predictable or reproducible when you compile the same code with different compilers or different optimization options.

This problem needs a solution. a+b and b+a must give the same results if we want NAN payload propagation to give consistent and reproducible results. There are several possible solutions if we want the combination of two NANs to be consistent:

Make a bitwise OR of the two NAN payloads. This solution is simple, but it limits the amount of information you can have in the payload to one bit for each error type. Some people want to include information in the payload about where the fault occurred, and this will not work when payloads are OR'ed.
Propagate the biggest of the two payloads. The advantage of this is that you can define priorities and propagate the NAN with the highest priority. The disadvantage is that the other payload is lost.
Propagate the smallest of the two payloads. This has the disadvantage that it will propagate empty payloads containing no information.
Make a new unique payload. This is a complicated solution and it doesn't make it easier to find the cause of the error.
Generate a trap when two NANs with different payloads are combined. This makes it possible to detect both payloads, but it defies the purpose of using NAN propagation rather then fault trapping.

The conclusion is that solution 2 should be preferred. Propagate the biggest of the two payloads. This solution will not violate the IEEE 754 standard and it will not break existing applications, but it requires a change in the CPU hardware.

A revision of the IEEE 754 standard is on the way. I have had a long discussion with the people who are writing the revised standard, but they do not want to change anything about NAN1+NAN2. The forthcoming revision will still be undecided about this problem.

Now, I think that Intel has the chance to fix this problem. As the market leader in SIMD processors, you are in a position to propose a solution. If you make a decision and publish it, it is likely to become a de facto standard.

This is the reason why I am asking this question. Is it feasible to change the hardware so that the combination of two NANs will propagate the one with the highest payload? And what is the time frame for such a change?

There is another problem with NAN propagation, namely the max and min functions. The current version of the IEEE 754 standard says the maximum of a NAN and a normal number is the normal number. So the max and min functions will not propagate a NAN. This problem will be solved in the forthcoming revision of the standard. A new set of maximum and minimum functions will be defined that are sure to output a NAN if one of the inputs is a NAN.

The x86 instructions MAXSD etc. are not following the unfortunate old standard but propagating the last operand if one is a NAN. This is useful because it is equivalent to the common high level language expression a > b ? a : b. You probably need to define a new set of max and min instructions that generate a NAN if any of the inputs is a NAN. These new instructions will be needed to match high level language functions that follow the forthcoming revision of the standard. The existing instructions (MAXSD etc.) will still be needed for matching the high-level languate expression a > b ? a : b.

jimdempseyatthecove · ‎03-19-2018

Unless you have a SIMD loop enabled with RTM/TSX handling (of SNAN and/or QNAN) it might be practically impossible to have the SIMD loop produce the same results as would the scalar loop. IOW have the fault and.or appropriate test for QNAN abort the transaction and then fall back to scalar loop to process the SIMD vectored loop in scalar mode. This would require the compiler to generate at least two versions of the loop (this may require an appropriate #pragma or compiler option as well as (potentially) a new SIMD instruction that aborts a transaction on QNAN.

Jim Dempsey

AFog0 · ‎03-19-2018

Thanks for the reply. I don't know why you want to use RTM/TSX handling if there is only one thread, but your reply reflects the kind of complexity I want to avoid. If the code is constructed so that NANs will propagate to the end result then you will get the same result with and without vectorization. And you don't need fault trapping.

NANs will propagate to the end result in most cases. The programmer has to be aware of exceptions such as these:

compare instructions convert floats to boolean which cannot contain NAN
conversion to integer. Integers cannot be NAN
max and min instructions. These will be fixed by the forthcoming standard
pow(NAN,0)

Am I missing something?

jimdempseyatthecove · ‎03-19-2018

>>I don't know why you want to use RTM/TSX handling if there is only one thread,

The point of the transaction region is to bail out of a SIMD'd loop and into a scalar version of the loop. Thus permitting the (scalar) loop (iterations) to produce the same behavior as if the loop had been scalar in the first place. This has nothing to do with coherency issues. IOW the RTM/TSX can be extended to provide fault enhancements with SIMD loops which currently are problematic.

In lieu of, or in addition to RTM/TSX one might be able to make use of a C++ extension (an/or Fortran), implemented with intrinsic instructions with something like the following sketch

for(int i=0; i<N; ++i)
{
   A = B + C;
   if(vec_isNAN(A) // across width of active lanes
   {
      ...
      int mask = get_NAN_mask(A); // across width of active lanes
      // mask is bit mask of width of vector
      // i has base of vector
     ... fixup as desired
  }
  // resumes at vector boundary
} // for - advance to next vector

The above loop executes as SIMD vector, programmer inserts intrinsic to test for any lane for NAN (INF, ???), the enclosed section can process the width of the vector lane at a time.

While you can produce this loop now using the SIMD intrinsics, you would like to be able to use the C++ statements that get vectorized in an abstract manner. Inclusive of the fact that the statements may process a full width vector as well as a partial width vector.

Jim Dempsey

AFog0 · ‎03-19-2018

Thanks Jim, but why not move the NAN check out of the loop? If you need to check for NANs after every instruction then the advantage of vectorization evaporates. If the code contains only instructions that are sure to propagate NANs then you need only check the final result. A linear code that relies on NAN propagation and has no fault trapping will behave the same when vectorized.

My point is that I want to make NAN propagation more reliable. I have pointed out some weaknesses in NAN propagation. Most importantly the non-commutative NAN1 + NAN2 = NAN1. A hardware change is needed to fix this problem.

jimdempseyatthecove · ‎03-20-2018

Agner,

When a loop is performing a reduction then the NAN can be tested in the final result. When (as the loop sketched above) the output of the loop is presented across an array, then the test (and corrective measure) needs to be made at the appropriate spots in the loop. Depending on your needs, the test could be made after each statement or after several statements. The programmer would make this decision. Use of a signaling NAN would eliminate the test overhead, but complicates the code necessary to correct/accommodate the results.

Consider an N-Body case where you generate a unit vector between two particles, and then use this vector. An exception case in the processing arises when the two particles co-reside at the same location. At this point you could choose to generate a null vector or a random unit vector (or a vector dependent upon other property). In this situation the test needs to be placed in an the appropriate location within the loop.

Jim Dempsey

TimP · ‎03-20-2018

This question depends to some extent on compiler language standards. In the C and C++ standards, left to right evaluation of expressions is mandated. Compilers such as Intel C++ and gcc/g++ have options to enforce this, the difference being that it is not the default for the Intel compiler. Intel documentation states that the option -fp-model precise is required to observe left to right evaluation, while -fp-model precise is required to meet an additional requirement for consistent exception handling, presumably including treatment of NaN.

The question posed above seems to be whether consistent exception handling could be achieved without observing the language standard, presumably with the aim of applying some of the optimizations implied by Intel default fp-model fast or g++ -ffast-math (including some associated with aggressive vectorization), without breaking treatment of NaN.

This appears to become a question of trade-offs between new features in the hardware, along with a compromise compiler option, which may affect performance, against the present methods where giving up on consistent NaN is a consequence of full compiler optimization.

A possible analogy is in the treatment of gradual underflow, for which a hardware method for efficient addition in the case without underflow was only implemented after years of requiring abrupt underflow to achieve performance. Efficient gradual underflow remains implemented only for addition (Intel says for the most often encountered cases).

AFog0 · ‎03-20-2018

Hi Tim

The evaluation of expressions from left to right means that a+b+c is evaluated as (a+b)+c, not as a+(b+c). All compilers are doing this for floats, as far as I know, with the precise model or an equivalent option. This has consequences for precision, even when there are no faults.

The compiler is allowed to evaluate a+b as b+a and a+b+c as (b+a)+c regardless of the precise option or other optimization options, because it has no consequence for precision. Take the example float a = 1.0E20, b = -1.0E20, c = 1.0. This will give:
(a+b)+c=1.0, (b+a)+c = 1.0, a+(b+c) = 0.0, (c+b)+a = 0.0.

As you see, the commutative rule preserves precision, the associative rule does not. This is why the precise option allows a+b = b+a, but it does not allow a+b+c = c+b+a. But the commutative rule does not preserve NAN payloads in the present hardware. If a = NAN with payload 1, and b = NAN with payload 2, then a+b = NAN(1) and b+a = NAN(2). This is a hardware issue, not a software issue.

Exception options are irrelevant here because my point is to use NAN propagation instead of exceptions in order to make vectorization simpler and to make the result independent of vectorization.

The hardware change I am proposing is to propagate the NAN with the highest payload when two NANs with different payloads are combined. This has no tradeoff for performance or backward compatibility. It costs a little more silicon which is only activated in the rare case where two NANs are combined.

I am also proposing new maximum and minimum instructions that comply with the forthcoming revision of the IEEE 754 floating point standard.