Performance delays - programming with QNan and Denormals

zalia64 · ‎03-13-2018

The floating-point spectrum holds some special numbers, such as QNan ( quite not-a-number), SNan (signalling not-a-number) and denormalised numbers.

On the pro side, QNan may be used to tag special cases, such as a missing data: You set any unknown item to QNan. Any aritmetic calculation with this missing item will result in QNan. So, if result #YY is QNan, you know it is based on missing data. The Nan property is sticky. Other special numbers may find use, too.

MATLAB uses QNans to tag missing data. Nans come as doubles ( 64-bit numbers) or floats (32-bit numbers). There are many QNan: wth doubles, the first 14 bits define the QNan condition, while the rest 49 bits is a user-defined tag.

On the con side, both Nan and Denormals numbers may cause significant delays. It is a special condition.

My question:

Are MOV instructions - movSD, movAPD, movHPD, etc. - delayed by Nan and Denormals?

Likewise:

Suppose the operation ORPD / ANDPD / XORPD created a Nan / Denormal /Special. Is there any penality ?

Why it matters?

In ideal case, one should not touch any unverified data. But many situations have partial data. Data sets may hold results from many sources - some including width/height/depth, some just volume, some only weight. For some uses, the weight suffice. Why throw away this data? If asked for total length, only data of the first source is acceptable. But for total weight, if the density is given, all three sets are good. That would mean coding a different loop for each result, each loop with many internal cases.

It is much simpler to code a single loop, and throw out (later) the results with 'invalid' tag - those that are QNans. Likewise, with other special numbers. Hence, the performance questions do matter.

McCalpinJohn · ‎03-13-2018

For the SIMD instruction sets (i.e., not x87), it seems reasonably clear that denorms and NaNs only have a performance impact for floating-point computational instructions, and that other instructions that operate on registers containing floating-point values simply move the bits without any interpretation.

This is easy enough to test, but one indication that the bits are not interpreted is the Intel compiler's use of "packed single" instructions for loads and stores, even if all the computational instructions are working with doubles. My understanding is that "single" is the default precision for these instructions, so that an extra prefix would be required to use the "packed double" versions of the instructions. If the bits were interpreted while being loaded, this data type substitution would not make sense.

The instruction descriptions in Volume 2 of the Intel Architectures Software Developer's Manual (document 325383) show that no "SIMD Floating-Point Exceptions" are raised by the bitwise instructions. Searching for "denorm" in Volume 2 very quickly shows that (excluding x87 instructions) this is only associated with floating-point instructions that must interpret the floating-point value in order to operate correctly. These instructions include add, subtract, compare, convert, multiply, divide, FMA, square root, reciprocal approximations, min, max, round, a bunch of instructions for extracting parts of floating-point numbers, some very strange "range restriction" operations, and not much else....

zalia64 · ‎03-15-2018

I was under the impression that every number loaded into the floating-point execution port, would get a sticky denorm\nan\inifinite tag, upon loading.

You suggest that those checks and tags are done later on in the pipe - upon execution.

Otherwise, I do not understand, why the XORPD exists at all - why not just PXOR ? bits are bits! The different names could have pointed to the same machine code!

andysem · ‎03-16-2018

As I understand it, at least on some architectures INT and FP domains are separate on the die, with some execution units duplicated. The different instructions employ execution units from different domains. You can also get a cross-domain performance penalty if you mix instructions for different domains while working on the same data in registers.

zalia64 · ‎03-16-2018

Perhaps you are correct.

xorPD and xorPS : both do the same, and both use the same execution port.

The machine code is (almost) same: 0F 57 /r versus 66 0F 57 /r. Why the extra byte for the xorPD version? Intel should know..

The main conclusion of the discussion: One may Read/Copy/Write/Change denormals freely without any performance penalty. as long as one doesn't use them in arithmetic expressions.

McCalpinJohn · ‎03-16-2018

The leading 66H byte is (in this case) an SSE operand size override, modifying the instruction to treat the operands as 64-bit values rather than as the default 32-bit values.

I don't know the history of the development of this feature of the Intel ISA, but one could argue that code generation is actually easier if all SIMD instructions are able to accept the operand override prefix -- even if the behavior is the same. In the current implementations (with floating-point and integer SIMD functionality accessed via the same ports), it would make sense to decode these two instructions into the same uop, since the behavior should be identical, while still allowing the instruction set to carry additional information on operand width if a future implementation needs it.

TimP · ‎03-20-2018

Some past CPUs had distinct floating point and integer load instructions. Then compilers had in practice to defeat them by using the integer instructions for situations where only data move/copy occurs, as both performance penalties and breaking of risky source code were likely. As John said, it's common practice for Intel-compatible compilers to use single precision simd loads on wider data types, to avoid the possible overhead of the additional instruction prefix, and this requires that no exceptions could be raised.

It's not clear to me, although John is more of an expert than most of us, that the overhead of the double width data movements is due to more than the extra length (in bytes) of object code and increased difficulty of correct alignments.

McCalpinJohn · ‎03-20-2018

For load instructions of a given SIMD register width, I don't measure any difference in performance between the various data types (byte, word, doubleword, quadword, float, double). It seems likely that some cases could be constructed for which the extra prefix byte(s) reduce the effective instruction fetch rate?. Most processors have a uop cache and loop stream detectors that should negate any overhead after the first loop iteration anyway....

One case where this might apply is on KNL. I have a code that demonstrates 2 vector loads per cycle with VMOVUPS instructions. The body of the loop is a sequence of instructions like:

        vmovups   192(%rsp), %zmm1

When I run objdump on the resulting executable, each of these loads occupies 8 Bytes, so 2 loads per cycle requires the full 16 Bytes per cycle instruction fetch bandwidth. I have not tried replacing these with VMOVUPD instructions, but if an extra prefix is required, it should not be able to reach full speed.