in the following code snippet version 1 performs measurably faster than version 2:
--- begin of code --- #define _mm_extract_pd(R, I) (*((double*)(&R)+I)) _m128d art_vect428;
// version 1: ph = atan(_mm_extract_pd(art_vect428, 0)); ph[i+1] = atan(_mm_extract_pd(art_vect428, 1)); // end of v1
// version 2: _mm_storeu_pd(&ph, _mm_atan_pd(art_vect428)); // end of v2 --- end of code ---
Looking at the assembly shows, that _mm_storeu_pd is decomposed in two writes (as it is done in v1), so that actually can't be the reason. Are there any other explanations beside "_mm_atan_pd is slower than two calls to atan"? It seems that other functions (sin, cos) show a similiar behavior.
while I minimized the code I detected, that the issue seems to be more complex. It happens only if the unroll - pragma is active. If it is commented out, both versions run at about the same speed. The "not SLOW_VERSION" code achieve that speed also with unrolling, but the other one (SLOW_VERSION defined) is slower. Here is the code (windows only), look for the SLOW_VERSION define:
Performance variations with unrolling are common and likely not to be reproduced on different CPU models. A likely reason would be fitting the Loop Stream Detection or not. I guess the largest loop body fitting loop stream detector might be achieved if it is possible to get 32-byte code alignment for the loop body. The worst situation may be when the loop doesn't activate LSD but is not unrolled sufficiently to approach full performance. I suspect yours may be one of the models which prefers massive unrolling, while the Core I7 style CPUs perform more consistently with unroll by 4. Westmere style CPUs appear sometimes to exhibit a higher unroll requirement again. Something about the hardware register renaming action. The addition of decoded instruction cache in future CPU models is supposed to help in avoiding problems with unfavorable amounts of unrolling. Needless to say, unrolling makes it difficult to get performance when loop count doesn't match unrolling. Your code presents more complex issues than are likely to be dealt with effectively here.
I don't think that the issue is directly related to unrolling as stated in my "/edit" clause. The behavior is reproducible irrespective of "#pragma unroll". Meanwhile I've traced both functions at assembly level. atan does a lot of computations while _mm_atan_pd loads some values from static memory. Maybe I've just encountered the dreaded cache issue again. Of course this immediately pops up the question if it's better to use math.h functions instead of svml functions at all.
As far as I know, svml functions are intended primarily to support auto-vectorization. With IPP and MKL vector library functions also supported, the scope for explicit calling of svml functions is limited, and doesn't get much support priority. Ideally, is consistent with allowing icc to choose between scalar and svml functions. There is also if you are interested in more usage of Intel library functions.