I am trying to use Intel C++ Composer XE 2013 to autovectorize a loop.
My loop contains a number of float to int conversions that I need to do in round down (toward negative infinity) mode. My initial implementation used floor() in order to achieve this but this is slow and causes problems for vectorization. If I drop the floor() and use a simple cast then performance improves but presumably I no longer get round down behaviour for negative numbers? Is there a way to do a vectorized int to float conversion in round down mode whilst still relying on autovectorization?
With default implicit cast, you would get round toward zero, unless you set the rounding mode (which might be feasible, if you reset the mode after your loop, if it is big enough to overcome the overhead). If you wish to vectorize floor(), you rely on the svml library auto-vectorization, which requires setting fast-transcendentals if you have set an option which disables that one.
I don't know if svml has floor() support (and for which ISA targets). If you find that psxe 2016 doesn't have this, and want a detailed answer about whether it could be added, you could submit your case on your IPS support account. You might also try your own vector floor() with rounding mode changes inside. I haven't tried out the setting of rounding mode in Intel C++ on the various operation systems.
The autovectorized code appears to be outputting vcvttss2si which means explicit truncation. Setting the SSE rounding mode before the loop would therefore not be sufficient unless the compiler is smart enough to understand that I have done so and output a different instruction.
I notice that IPP has a float to int conversion with rounding mode. Maybe my best approach is to take the float to int conversions out of the loop and use that.
If your entire loop operates correctly in "round down" mode, then the overhead of changing the rounding mode should be negligible. The LDMXCSR and STMXCSR instructions are not privileged. According to http://www.agner.org/optimization/instruction_tables.pdf, the LDMXCSR instruction has a 3 cycle latency while the STMXCSR has a 1 cycle latency on recent Intel processors (SNB, IVB, HSW).
Even if you need to change the mode inside the loop it should not be too bad -- you could execute the LDMXCSR instruction before the loop to get the baseline version, then use one GPR to hold that version and another to hold the version modified to "round down" mode. Inside the loop you would not need to read the MXCSR, you would only need to write it with the desired value at each point you needed it to change.
Of course none of this does any good helping with autovectorization....
Sam, see if this is the cause of the issue: to convert float to int you should use the function floorf(x), not floor(x). In the global namespace, math functions are not overloaded.
Or, if you want, use the std namespace function std::floor(x) - this one is overloaded.
I checked with icc 13.0.0 and later compiler versions. In the simple test case the floor function was vectorized. We compiled with " icc -c -fargument-noalias -vec_report2". Do you have a test case where the floor function is not vectorized?
Thanks for all the replies.
I've changed a few things since my original post including splitting my loop into a few simpler loops and adding some #pragma simd directives. Where previously I was getting warnings like "remark: vectorization support: call to function floorf cannot be vectorized." it now uses __svml_floorf4() in place of floor() or floorf() and the loop does get vectorized. It seems like this must still incur some overhead compared with setting the rounding mode outside the loop and doing without the floor() altogether but it doesn't seem to be very significant at the moment.