This thread has become rather

Marián__VooDooMan__M · ‎06-02-2014

Greetings,

I wonder why fmod() is not vectorised (using it as intrinsic function) when I use mutithreaded debug DLL as CRT. In assembler output there is call to CRT's fmod().

c:\path\to\echo2.hpp(159,32): message : loop was not vectorized: statement cannot be vectorized
c:\path\to\echo2.hpp(159,32): message : vectorization support: call to function fmod cannot be vectorized

When I use mutithreaded static CRT, in the assembler output there are few floating point instructions for fmod() and whole loops is vectorised.

I would be fine if it were (at least in release builds) vectorised and treat as an intrinsic function. And other C/++'s floating point functions as well, of course.

TimP · ‎06-02-2014

I suppose fmod() may be expected to prefer accuracy over speed in debug mode.

Marián__VooDooMan__M · ‎06-02-2014

Tim Prince wrote:

I suppose fmod() may be expected to prefer accuracy over speed in debug mode.

I'm using IPO and HLO and such even in debug mode, because my application is generating sound in real-time, but of course, release mode execution is faster, but takes too long to build. When I'm sure in debug mode that there is no bug, then I use release build.

But in either case, I am using "precise" floating point model.

I'm not sure I got your point, but I expect both accuracy and speed, even in Debug mode, since without optimizations in debug mode it is even impossible to start-up application in like 5 minutes (I mean its internal initialization).

Marián__VooDooMan__M · ‎06-02-2014

Maybe I was not so clear. Even in release build this issue is present.

When I use DLL version of CRT, there is call to fmod() in CRT, but if I use STATIC version of CRT, there is not such call but fmod() implementation in few assembler instructions.

I'm afraid CALL instruction adds overhead in opposite to "inline" implementation, and what is worse, it prevents vectorisation of whole loop.

Bernard · ‎06-02-2014

I checked Agner's instruction table and unfortunately there were no information present about the latency in CPU cycles of CALL instruction.

Bernard · ‎06-02-2014

ntel Optimization Manual states the latency of CALL instruction is 5 cycles.

Marián__VooDooMan__M · ‎06-02-2014

It is all okay regarding CALL. But the point is that the loop containing such functions is not vectorised.

Marián__VooDooMan__M · ‎06-02-2014

@Intel

This is a feature request, to use inlined assembly for floating point library calls, instead of calls to MS's DLL CRT library, just inline assembly code, when e.g. "#pragma intrinsic(fmod)" is present. This case is related to https://software.intel.com/en-us/forums/topic/405440 ,except ICC should provide it's own assembler instructions in case of DLL CRT (istead of CALL instructions) when it detects intrinsic pragma before function. This is automatic in "STATIC" CRT, but I'd like to see it when DLL CRT is used as well, because linking statically is heavy time consuming operation.

And therefore vectorisation of whole loop containing these intrinsic assembler code is automatic, I guess (unlike when CALL instructions to CRT DLL are present).

Marián__VooDooMan__M · ‎06-04-2014

NB: the point is, when using "DLL CRT" ICC prevents vectorisation of functions containing such functions as "fmod()", unlike "static CRT". And I take this as a performance impact, because this issue is present in release profiles as well.

JenniferJ · ‎06-04-2014

are you using the 14.0 or 15.0 compiler? I'm trying to duplicate the issue with simple testcase, but couldn't.

It's great that you've found a work-around with "#pragma intrinsic(fmod)". could you send a testcase?

thanks again,

Jennifer

Marián__VooDooMan__M · ‎06-09-2014

Please, disregard my post, since I have found there is "CALL fmod". I must have been bind.

butt stilI engourage you to implement inlined intrinsic form without CALL to slow (multiversioned) MSVC library.

#pragma instrinsic(any_PF_function)

where "any_PF_function" would be floating-point instruction as inline instead of CALL to MSVC (slow, multiversioned) library. I would like to see iline assembled instructions in output instead of "CALL fmod". It might be faster... This is a feature request.

"#pragma instrinsic(fn)" is not supported by ICC, but with MSVC it is do.

this is related to request feature @ https://software.intel.com/en-us/forums/topic/405440

jimdempseyatthecove · ‎06-09-2014

Maybe you can use this:

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-DAFA16CE-DB78-4FFD-9C1E-4AE0EA96CEA7.htm

Jim Dempsey

Marián__VooDooMan__M · ‎06-09-2014

jimdempseyatthecove wrote:

Maybe you can use this:

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-DAFA16CE-DB78-4FFD-9C1E-4AE0EA96CEA7.htm

Thank you Jim, but I'd like to be the code more portable, even though most of compiler have header file for these intrinsics.

JenniferJ · ‎06-12-2014

Yeh, we have a feature request (DPD200042138) for the "#pragma intrinsic". I will associate this thread with the existing FR.

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Marián__VooDooMan__M · ‎06-20-2014

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Yes, this command-line argument helps a lot. But still I have in assembly output instructions "call fmod". Worse, I am unable to do reproducer to this issue.

Marián__VooDooMan__M · ‎06-20-2014

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Great news! I was able to make a reproducer. Please, see attachment "fmod.7z", and select "Relase|x64" profile and observe error:

1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(56,2): message : loop was not vectorized: unsupported loop structure
1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(196,9): message : vectorization support: call to function fmod cannot be vectorized
1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(196,9): message : loop was not vectorized: type conversion prohibits vectorization

I'm afraid of above "vectorization support: call to function fmod cannot be vectorized" which turns to .asm output:

;c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp:196.9
$LN246:
  0007a e8 fc ff ff ff   call fmod

This should be intrinsically computed instead of call to MSVC CRT library, which could be slow when I use /QxHOST . I am on i7 Haswell, ICC 14.0.

I belive intrinsic implementation is much faster than MSVS's CRT library call.

Marián__VooDooMan__M · ‎08-08-2014

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

This adding didn't helped. *.asm dumps read the same. Moreover, I read this option is causing lose of last few ULP's in float calculations, which is not desirable in my case.

Marián__VooDooMan__M · ‎08-08-2014

just a side note, there are more transcendentals that are not computed in-line like they were intrinsics (not only fmod), but there are calls to Windows's CRT instead. I'd like to exploit AVX-2 (and below) to compute them, or even vectorise them, if possible, plus get rid of "CALL" instruction which could flush instruction cache, plus compute it in slow way compared to my CPU possibilities, since CRT is "universal".

TimP · ‎08-08-2014

This thread has become rather confusing. The built-in partial support for math functions (other than sqrt) is for x87 long double non-vector. I doubt there is any feasible way or incentive to make simd math functions in-line. Even scalar simd math functions from external library are likely to run faster than x87 intrinsics.

/Qfast-transcendentals is usually associated with vectorization using Intel svml (short vector) library. It wouldn't be expected to produce in-lining, but it should produce significant performance gains for any reasonable loop length (even shorter or longer loops than are optimum for in-line vector code). A reason for making it optional is the possibility that svml is less accurate (possibly up to 4 ULP error). fast-transcendentals is set off by options like /fp:source but then can be re-enabled.

As Intel wrote off high performance x87 math intrinsics with the introduction of SSE2, which now has become the default architecture for Intel compilers even in ia32 mode, the possibility of optimizing x87 math functions seems difficult to support.

If there are specific math functions where currently ICL links to Microsoft math library (if that is what was meant above) it may take some real evidence that improvements are possible, if for example there is an opportunity to augment svml.

Marián__VooDooMan__M · ‎08-08-2014

Tim Prince wrote:

The built-in partial support for math functions (other than sqrt) is for x87 long double non-vector. I doubt there is any feasible way or incentive to make simd math functions in-line.

Yes, long double is problem. But what about double? That was my question.

Tim Prince wrote:

Even scalar simd math functions from external library are likely to run faster than x87 intrinsics.

Really? Even with /QxHOST, when ICC knows my CPU's (Haswell, x64 target) "metrics" ? I don't deploy my application, I am bound with my build on my own machine. (Though it is written in portable way).

Tim Prince wrote:

/Qfast-transcendentals is usually associated with vectorization using Intel svml (short vector) library. It wouldn't be expected to produce in-lining, but it should produce significant performance gains for any reasonable loop length (even shorter or longer loops than are optimum for in-line vector code). A reason for making it optional is the possibility that svml is less accurate (possibly up to 4 ULP error). fast-transcendentals is set off by options like /fp:source but then can be re-enabled.

Jennifer from Intel recommended that above... but I finally dropped it from my command line.

Thanks!

Marián__VooDooMan__M · ‎08-16-2014

Tim Prince wrote:

As Intel wrote off high performance x87 math intrinsics with the introduction of SSE2, which now has become the default architecture for Intel compilers even in ia32 mode, the possibility of optimizing x87 math functions seems difficult to support.

I am not speaking of ia32 mode you have mentioned, but about intel64 mode (x64 on Haswell architecture).

Or you meant by ia32 all Intel-based architectures, even x64 one?

Tim Prince wrote:

If there are specific math functions where currently ICL links to Microsoft math library (if that is what was meant above) it may take some real evidence that improvements are possible, if for example there is an opportunity to augment svml.

I will try to play with SVML. Can you give me better results and tutorials than Googole? Or, is it better to Google?

TIA!

loop is not vectorised when it contains call to fmod() (Windows)