Quote:TimP (Intel) wrote:

styc · ‎11-16-2012

Attached is a small program implementing the Newton-Raphson iteration for solving y = x * exp(x). ifort 13 does not vectorize the program unless the MIC architecture is targeted. Comparing the Fortran again with the equivalent C code written using the elemental function extension, the C code shows a 1.8x speedup when measured on Nehalem. Arguably, icc 13 is not optimizing hard enough, either. A version based on intrinsic functions shows 2.1x speedup over the Fortran code. Greater gains can obviously be expected on Sandy/Ivy Bridge.

TimP · ‎11-16-2012

I see that the Fortran elemental doesn't have the same effect on optimization here.as writing in the parallel intrinsics in icc. If I set more aggressive options, I get the message "not inner loop" indicating that the compiler hasn't learned outer loop vectorization for this situation. In effect, in your C code intrinsics, you have explicitly pushed enough work inside the while loop to take advantage of simd.

styc · ‎11-16-2012

TimP (Intel) wrote:
I see that the Fortran elemental doesn't have the same effect on optimization here.as writing in the parallel intrinsics in icc. If I set more aggressive options, I get the message "not inner loop" indicating that the compiler hasn't learned outer loop vectorization for this situation. In effect, in your C code intrinsics, you have explicitly pushed enough work inside the while loop to take advantage of simd.

To clarify, I misread the generated assembly code for the MIC architecture. It is not vectorized, either. I did not explicitly make the loop body heavier in the intrinsic code than in the scalar code. The algorithm is exactly the same. This is a case where vectorization is almost always beneficial. Masking adds some small overhead, but you save a lot from vectorized division alone.

ifort 13 does not vectorize Newton-Raphson iteration coded as elemental function