Re: Detailed info about FTZ & DAZ

gol · ‎07-03-2008

Hi there.

When I started programming SSE, I was always wondering why there were operations that seemed to do the same thing, afterall both MOVAPS, MOVAPD, MOVDQA should result in the same thing, loading 128bits, right?

Then I found more detail about FTZ & DAZ, and realized that DAZis forcing to zero(I think) when loading data, or at least before operations (but which ones?), and I then realized how bad it would be to load integers using the float versions of the MOVs.

Now my problem: I have4 DWORDS (signed integers)at the bottom of 2 XMM registers, and I'd like to pack them into one. But I'm not seeing any quick way to do this, which is why I'm looking for details aboutDAZ (looks like it's very hard to find), to know if it applies to MOVLHPS or SHUFPS. Or should I use shift+OR?

Also, is there any other reason than DAZ to avoid mixing MOVs?

Finally, this is about the IPP libraries, is there a risk to flush denormals to zero when simply copying/moving blocks of memory using the IppsMove_32f or 64f versions? Should IppsMove_64s be preferred?
Note: this is in an audio sequencer so the DAZ flag is always forced on (when present).

Thanks.

Nicolae_P_Intel · ‎07-03-2008

gol:

Now my problem: I have4 DWORDS (signed integers)at the bottom of 2 XMM registers, and I'd like to pack them into one. But I'm not seeing any quick way to do this, which is why I'm looking for details aboutDAZ (looks like it's very hard to find), to know if it applies to MOVLHPS or SHUFPS. Or should I use shift+OR?

My two cents:One other solution would be SHUFPS and PBLENDW in SSE4.1.

gol · ‎07-03-2008

Thanks, but I would avoid SSE4. It's a mainstream app so I can only now assume that every user's system supports SSE1 (and it's not even the case), I do optionally support SSE2, but I don't consider using SSE3 & above until a couple of years. (and I would add, my compiler doesn't support SSE4 yet, and I would ratheravoid machine code)

Did you mean SHUFPS -or- PBLENDW? Because I don't know if I can use SHUFPS when DAZ ison, that's my point. In fact, I should probably just test it, but I'd rather read more about DAZ.

TimP · ‎07-03-2008

I don't understand how the subject has come up if you're "avoiding machine code." If you are using asm or intrinsics, and you are doing things which make you concerned about DAZ, you should test yourself with DAZ reset. DAZ set permits floating point arithmetic to treat subnormal operands as zeros, as they would be if they were generated with FTZ set. As most applications where DAZ could have a measurable performance effect are working mostly with internally generated data, which are direct results of floating point operations, setting FTZ will eliminate most of the effect of DAZ setting.

If you desire to use instructions of the last 5 years, I find it difficult to understand why you would avoid using a correspondingly updated compiler which could take care of these issues. If a significant number of your customers are using CPUs which are older than that, it is difficult to understand how you do anything about performance issues associated with choice of instruction set, other than live with them.

gol · ‎07-03-2008

I don't understand how the subject has come up if you're "avoiding machine code."

I wrote I wanted to avoid machine code (you know, numbers), not assembler. I don't expect to write machine code anymore in 2008. When your compiler doesn't support something, the last option is machine code.

DAZ set permits floating point arithmetic to treat subnormal operands as zeros, as they would be if they were generated with FTZ set.

Except that DAZ does it before processing, so I'd like to know if it applies to most data moving operations. And yes since then I have tested, and DAZ doesn't seem to affect MOVSS, SHUFPS, MOVHLPS, etc. But I'd like to be sure that it won't change in the future (or in 64bit programming), to read it from an official paper, to avoid confusion later.
(and in this case, I'd like to know what's the diff between MOVAPD, MOVAPS & other 128bit MOVs)

setting FTZ will eliminate most of the effect of DAZ setting

I would rather say, setting DAZ will eliminate most of the effect of FTZ, no?

If you used x87 instructions as a way of moving data, most integer data would be treated as subnormal and would be destroyed.

since when does the x87 destroy denormals?

why you would avoid using a correspondingly updated compiler

that compiler doesn't exist, it's Delphi and the current version supports SSE3 & lower. Besides, I wrote it was a mainstream app,not manyusers have systems that support SSE3 today. We -still- have users with systems that don't support SSE1.
A simple single packing of 2 DWORDS doesn't justify using an SSE4 function in a loop that perfectly works using SSE1 or 2. I was just asking if it was safe enough to use a float packing operation to pack integer data, that's all.

gol · ‎07-04-2008

I just realized that PUNPCKLQDQ was doing about the same as MOVLHPS, but for integers, so that's what I'm gonna use.

I'd still like to know more about DAZ & why there are so many instructions doing the same thing, while explained totally differently. If PUNPCKLQDQ had been in SSE1, I don't think there would have been a need for MOVLHPS, or if the benefit of strictly float operations was made clear somewhere, there wouldn't be a need for their integer equivalents.

While searching for this stuff, all I found out were forums in which people were wondering the same about DAZ, and wondering if there was a benefit in not mixing operations of different types (mixing single, double & integer operations for the same sources), to which someone loosely replied that SSE was expecting the right types or would slow down (but I'd rather read this from an official source).

TimP · ‎07-04-2008

setting FTZ will eliminate most of the effect of DAZ setting

I would rather say, setting DAZ will eliminate most of the effect of FTZ, no?

The usual reason for setting FTZ is in order to gain performance. Storing subnormals with FTZ set is slow, and pointless, if they will be consumed with DAZ.

If you used x87 instructions as a way of moving data, most integer data would be treated as subnormal and would be destroyed.

since when does the x87 destroy denormals?

Sorry about that, the only objection is the extreme slowness. I was thinking of other past pitfalls in moving character data by floating point instructions.

gol · ‎07-04-2008

Edit: ok it's actually slower, so I don't know. Here's how my code looks:

I have 2x2 DWORDS stored in the low part of xmm1 & xmm5 (it's a double to trunc/frac converter), and 2x2 doubles packed in xmm0 & xmm1. This is the part where I store them to memory.

MOVQ [EDX],xmm1 // store integers
MOVQ [EDX+8],xmm5 // store integers
MOVAPS [EAX],xmm0 // store fracs
MOVAPS [EAX+16],xmm4 // store fracs

So I thought, let's pack xmm1 & xmm5, so that I can store them at once, plus I know that EAX is always aligned. But as always, the results of this kind of thing is more or less random, never makes any sense to me (or maybe it's the random code alignment, but I can't know because my compiler doesn't allow custom code alignment).

So I try this:

PUNPCKLQDQ xmm1,xmm5
MOVAPS [EAX],xmm0 // store fracs
MOVAPS [EAX+16],xmm4 // store fracs
MOVDQA [EDX],xmm1 // store integers

Sadly it's slower than the original code(?), on my new Q6600. I could get it slightly faster with MOVLHPS as I originally wanted to use:

MOVLHPS xmm1,xmm5
MOVAPS [EAX],xmm0 // store fracs
MOVAPS [EAX+16],xmm4 // store fracs
MOVDQA [EDX],xmm1 // store integers

..but then I'm back to my original problem, I don't know if MOVLHPS will always be safe for integer manipulation in future or other CPU's, when DAZ or FTZ is enabled.
The result is the same thing (on my Q6600), but the speed is different. And since PUNPCKLQDQ is an SSE2 instruction, it must have been introduced for a good reason I guess - I don't see why Intel would add to SSE2 an instruction that does the same thing as an SSE1 instruction, but does it slower. That, or it has different latencies/pairing rules/whatever.. Or it's just there for masochists.

This code isn't really that important, but I wouldn't like to grow bad habits of mixing operation types if it's not officially safe to do so.

gol · ‎07-04-2008

(also, yes I've tried different pairings for the code, same results)

The usual reason for setting FTZ is in order to gain performance. Storing subnormals with FTZ set is slow, and pointless, if they will be consumed with DAZ.

So you advise to switch FTZ off if DAZ is present & enabled, to gain performances? Will try that. I thought it wouldn't have made any difference.

TimP · ‎07-04-2008

No, I was pointing out that when FTZ is set, it's often not necessary to set DAZ for performance.

gol · ‎07-04-2008

Then why does DAZ exist? It was introduced on CPU's while FTZ was already there, there must have been a reason.

And seriously, this must be documented somewhere, I've never seen any CPU instruction or flag that wasn't. But it's not in any of the docs I have here (only briefly mentionned).

This page: http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-dazsays "DAZ fixes the cases when denormals are used as input, either as constants or by reading invalid memory into registers."
If "invalid memory" refers to denormals, then it doesn't work on my CPU. As written above, a MOVSS (& other MOVs but I haven't tested them all) keeps denormal values (MOVSS from memory & then back to memory, denormals are preserved). A MULSS however does not (so DAZ does its job here).

Also this: "To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications." could be interpreted as 'use FTZ -and- DAZ when possible' (or not)

AFog0 · ‎07-14-2008

FTZ & DAZ have no effect on move-instructions such as MOVSS, MOVHLPS, SHUFPS, etc.
See "Optimizing subroutines in assembly language" chapter 13.2. www.agner.org/optimize

gol · ‎07-14-2008

Thanks.
So you found this by testing too, there is really no official doc about it?

The problem is that it may then change in the future, probably not with a DAZ flag suddenly affecting move operations (I guess this would cause troubles in existing code), but you mentionned a reformatting delay, I can imagine a new CPU introducing something similar, slowing down existing code. Afterall, Intel MUST have something in mind when they offer so many different instructions to do the same thing (right now).

I mean as long as it's not carved in stone, or when it's labelled as 'freely interpretable by the CPU' (like the prefetches), it's likely to change. The huge denormalization penalty on the P4 was a really big problem for audio apps, while technically not even a bug or defect.

AFog0 · ‎07-14-2008

Yes, I did find this by testing it on all the different processors I could get access to, but you can still rely on these instructions working correctly on integer data.

The docs say that ADDSS can make denormal exceptions, MOVSS can not. If MOVSS were to behave differently on denormal operands it would violate the specification saying no denormal exception.
There will be no normalization penalty on future processors, only a possible penalty for moving between integer and floating point execution units, which is typically one clock cycle. Many shuffle instructions etc. have these penalties anyway because they use integer execution units for floating point data.

SHIH_K_Intel · ‎07-22-2008

I think the main questions in this thread revolve around:

A. When do one use FTZ and/or DAZ.

B. Why so many different flavors of SIMD data movement instructions and so many varieties of SIMD instruction dealing with moving data element around.

I will only try to offer my personal observations, and don't hold my explanations accountable as the official architecture or microarchitecture spec.

Regarding the 1st question. denormal value can occur in FP calculations either as an input value feeding into an FP calculation or as the result of a numerical FP calculation. From the hardware perspective, these are two different situations. If IEEE compatibility is required numerically, the hardware needs to invoke assists for special handling, the performance impact is significant slow down. This is the rationale for introducing the MXCSR control bit for FTZ and DAZ.

When FTZ is set, it allows hardware to perform numerical FP calculations (such as ADDPS/ADDSS) and flush underflowed result to Zero. If the result of that FP calculation also serve as input values for a subsequent FP calculation (MULSS/MULPS), not setting FTZ would mean MULSS/MULPS faces a denormal input value and must invoke special assist.

In this situation, setting FTZ can prevent MULSS/MULPS to take an assist, improving performance.

In a similar but different situations, the memory operand of MULSS/MULPS had a denormal value fetchedfrom memory (un-initialized memory, integer data), having FTZ turned on will not avoid the assist due to denormal input value loaded from memory. DAZ serves this purpose by flushing the input denormal value to zero before performing the numerical calculation.

Based on the above, it's not difficult to extrapolate that
1. Only instructions dealing with FP numerical calculations can experience underflow, hence the interaction with FTZ, DAZ. More specifically, it is the FP numerical instruction that needs to consume a denormal value, whether that denormal value came from a previous instruction or from memory.
2. Instructions that move data in bulk, between memory to register or register to register are not affected.
3. Instruction that moves SIMD data elements from one slot to another slot (no FP numerical calculations performed) are not affected.

I believe these observations are true, but I would caution that it is difficult to make prediction in future ISA extensions. Because it would require precise definition of what is "FP numerical calculation" vs. "Moving data bits and not considered FP numerical calculation" for instructions that haven't been defined. At a simplistic bit-by-bit level, all instructions are merely moving bits around conceptually. I think the trend of hardware design is to prefer allowing future architecture flexibility and have the definition of specific cases driven from some kind of macroscopic software perspective. So I would caution the observation of 2 and 3 should not be considered as architectural in future ISA.

AFog0 · ‎07-23-2008

There is tons of code out there that uses movaps for integer data. There is also code that uses movss, movsd, shufps and shufpd on integer data - including the xnu commpage, which is part of the Mac OS X kernel!

Intel can't change the behavior of these instructions without breaking existing code, and violating the specification that these instructions do not make floating point exceptions.

gol · ‎07-23-2008

Intel can't change the behavior of these instructions without breaking existing code,

But, as I wrote above, they could make it slower, or a lot slower, without 'breaking' existing code. Well, the FPU denormalization itself was a lot slower on the P4, this caused BIG problems in existing audio apps (which, at that time, made AMD's a much better choice for audio), while there was nothing officially 'broken'.

(& it could also be broken by mistake, remember the infamous FDIV..)

It also doesn't answer the question: what did the engineer have in mind when they introduced at least 4 instructions to do exactly the same thing? This is where you assume they may not do the same thing in the future.

About Macs, it doesn't look like it would bother Apple to adapt an OS for the hardware, they don't much care about compatibility from what I see :)