Effect of disable FTZ mode.

tran_l_ · ‎07-06-2015

In this link : https://software.intel.com/en-us/forums/topic/542786

The effect describe as this sentence: "This bit doesn't affect correctness of IPP functions, in some rare cases it can affect performance only."

1. Correctness of calculation

When program disables FTZ mode, this mode doesn't affect correctness of IPP functions.
However, is correctness of other functions of Intel affected? (Ex: SIMD)

2. Decrease in performance
When does decrease in performance occur?
Does it happen in float calculation when underflow occurs?
Do you have any detail information (such as, how many percents, how may seconds ...) about "decrease in performance" in following cases?
- In case of one calculation.
- In case of one hundred million calculations.
If this phenomenon gives a serious effect in performance, what should do I in order to resolve this problem(avoid underflow exception)?
If you have any hint for this case, please tell me. I appreciate for your support.

Igor_A_Intel · ‎07-08-2015

Hello,

Take a look at, please, and try the next example:

#include <stdio.h>

#include "ipp.h"

#define tapsLen 24

#define numIters 4096

#define nLOOP 10000

int main(){

const IppLibraryVersion* lib;

int specSize, bufSize, i;

Ipp32f pTaps[tapsLen], *pSrc, *pDst, pDlyIn[tapsLen], pDlyOut[tapsLen];

Ipp8u *pBuf;

IppsFIRSpec_32f *pSpec;

Ipp64u strt, stop;

Ipp64f cpMAC;

lib = ippsGetLibVersion();

printf("%s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);

pSrc = ippsMalloc_32f( numIters );

pDst = ippsMalloc_32f( numIters );

ippsVectorJaehne_32f( pSrc, numIters, 0.001f ); /* init source vector */

ippsZero_32f( pDlyIn, tapsLen ); /* init delay line */

for( i = 0; i < tapsLen; i++ ) /* init taps with min normalized numbers */

pTaps = IPP_MINABS_32F;

ippsFIRSRGetSize( tapsLen, ipp32f, &specSize, &bufSize );

pSpec = (IppsFIRSpec_32f*)ippsMalloc_8u( specSize );

pBuf = ippsMalloc_8u( bufSize );

ippsFIRSRInit_32f( pTaps, tapsLen, ippAlgDirect, pSpec );

ippSetFlushToZero( 0, NULL );

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf ); /* warm caches */

/* FTZ is cleared, as taps have min normilized values and data is < 1.0 (by abs val)

- each multiplication will produce denormal value (underflow) */

strt = ippGetCpuClocks();

for( i = 0; i < nLOOP; i++ )

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf );

stop = ippGetCpuClocks();

cpMAC = (Ipp64f)( stop - strt )/(Ipp64f)( numIters * tapsLen * nLOOP);

printf("undeflow, FTZ=0, cpMAC = %f\n", cpMAC );

/* FTZ is on, as taps have min normilized values and data is < 1.0 (by abs val)

- each multiplication will produce zero (flash to zero all underflows) due to FTZ=1 */

ippSetFlushToZero( 1, NULL );

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf ); /* warm caches */

strt = ippGetCpuClocks();

for( i = 0; i < nLOOP; i++ )

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf );

stop = ippGetCpuClocks();

cpMAC = (Ipp64f)( stop - strt )/(Ipp64f)( numIters * tapsLen * nLOOP);

printf("undeflow, FTZ=1, cpMAC = %f\n", cpMAC );

/* FTZ is off, as taps have "normal" values and data is also in the "normal" range,

- filtering operation will not depend on FTZ state */

ippSetFlushToZero( 0, NULL );

for( i = 0; i < tapsLen; i++ ) /* init taps with "good" values */

pTaps = 1.f;

ippsFIRSRInit_32f( pTaps, tapsLen, ippAlgDirect, pSpec );

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf ); /* warm caches */

strt = ippGetCpuClocks();

for( i = 0; i < nLOOP; i++ )

ippsFIRSR_32f( pSrc, pDst, numIters, pSpec, pDlyIn, pDlyOut, pBuf );

stop = ippGetCpuClocks();

cpMAC = (Ipp64f)( stop - strt )/(Ipp64f)( numIters * tapsLen * nLOOP);

printf("no undeflow, FTZ=0, cpMAC = %f\n", cpMAC );

return 0;

}

An output from my laptop:

ippSP AVX2 (l9) 9.0.0 (r47716) 9.0.0.47716
undeflow, FTZ=0, cpMAC = 15.582019
undeflow, FTZ=1, cpMAC = 0.160289
no undeflow, FTZ=0, cpMAC = 0.152346
Press any key to continue . . .

Therefore you can see that denormal values affect your app performance up to 100x times.

Also take a look at DAZ bit (Denormals Are Zeroes).

Regarding accuracy - if FTZ and DAZ are set on, accuracy at the end of FP range will be slightly worse as all denormal numbers will be considered as zeroes.

regards, Igor

tran_l_ · ‎07-09-2015

I can’t built your sample. Error message below:

error: identifier "IppsFIRSpec_32f" is undefined

error: identifier "ippsFIRSRGetSize" is undefined

error: identifier "ippAlgDirect" is undefined

error: identifier "ippsFIRSRInit_32f" is undefined

error: identifier "ippsFIRSR_32f" is undefined

I find IppsFIRSpec_32f function at include of IPP compile but don't have.

Chao_Y_Intel · ‎07-09-2015

Hi,

during the linkage, you need to include the following libraries:
ippcore.lib, ippvm.lib ipps.lib for dynamic linkage,
or ippcoremt.lib, ippvmmt.lib ippsmt.lib for static linkage.

Thanks,
Chao