- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'd like to talk about two weird things in the optimization process of the compiler.
#1 : sqrtsd seems preferred over sqrtpd... (just did a 30% performance boost by forcing the use of 2 sqrtpd instead of the 4 sqrtsd previously generated in my code). I know it's not often that sqrt is used sequentially so I won't mind if that feature doesn't appear. However here comes #2.
#2 : in an intrinsics based function the result must be _mm_store..'d to be returned so I expected some RVO to appear in the assembly in the case the returned value is to be loaded in an xmm just after that. Instead there is a pair of store/load to/from an unused local stack variable that could be simplified by a mov between 2 xmm registers. I find that a bit strange being used to see ICC getting rid of everything it can, making dead code elimination a hell to avoid for performance tests btw :)
Note : the code is compiled with the Ox flag by the last ICC integrated into MSVC2013.
Thank you in advance for your answers.
An Intel fan, trying to optimize even what doesn't need to be optimized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that using a reinterpret cast is ok ... but is it portable ? I'll check when I find some time and compilers to try with.
return *(reinterpret_cast<double*>(&y));
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) If you mean packing independent sqrt calculations into a parallel instruction, the compiler doesn't do it automatically. You might persuade it by packing them into a short vector or a _m128 when using the target instruction set which you mention. The potential gain for this would depend on specific CPU models, among other things (Harpertown, Ivy Bridge, and Haswell have significantly lower sqrt latencies than others, hence less gain for packing 2 sqrts). Sandy Bridge and Ivy Bridge don't gain as much as Haswell should by packing 4 of them rather than 2 into a single parallel instruction, but such distinctions aren't likely to be made by a compiler, beyond the extent to which you could make it happen by specifying a _m128 when building under AVX. But I could be mis-guessing your meaning.
2) I don't think the situation you describe is clear, but when using intrinsics, you could explicitly carry an operand past an mm_store, so the compiler would not necessarily be wrong to take your code literally if you asked for store and reload.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As @Tim hinted your data probably is not packed in _m128d union thus forcing compiler to use sqrtsd instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll try to be more clear this time. The structure used is like that :
struct __declspec(align(16)) vec4d { // in fact it's highly templated, but this is equivalent to the one instianciated. union { double x; double r; double v[1]; } union { double y; double g; } union { double z; double b; } union { double w; double a; } };
1) Differences between the disassemblies :
double sum_sqrt_simple() { return std::sqrt(v[0]) + std::sqrt(v[1]) + std::sqrt(v[2]) + std::sqrt(v[3]); } movsd xmm1,mmword ptr [v4] movsd xmm0,mmword ptr [rbp+0A8h] sqrtsd xmm1,xmm1 sqrtsd xmm0,xmm0 movsd xmm2,mmword ptr [rbp+0B0h] addsd xmm1,xmm0 sqrtsd xmm2,xmm2 movsd xmm3,mmword ptr [rbp+0B8h] addsd xmm1,xmm2 sqrtsd xmm3,xmm3 addsd xmm1,xmm3 whereas a faster way (at least on my machine, but that's why i'm asking) would be : sqrtpd xmm6,xmmword ptr [v4] sqrtpd xmm0,xmmword ptr [rbp+0B0h] addpd xmm6,xmm0 movaps xmm5,xmm6 unpckhpd xmm5,xmm4 addsd xmm6,xmm5
2) The thing is by implementing the second way manually with that :
double sum_sqrt_opti() { double ret; __m128d x = _mm_load_pd(v); __m128d y = _mm_load_pd(v + 2); x = _mm_sqrt_pd(x); y = _mm_sqrt_pd(y); x = _mm_add_pd(x, y); y = _mm_unpackhi_pd(x, y); y = _mm_add_sd(x, y); _mm_storel_pd(&ret, y); return ret; }
it generates that : ( at the return point when the call is used as an rvalue in an expression, eg : std::cout << vec.sum_sqrt_opti(); )
movlpd qword ptr [rbp],xmm0 movsd xmm6,mmword ptr [rbp]
I hope it's more clear that way, thanks in advance for your updated answers :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that compiler will chose to use scalar version of SSE instructions set when your code is using union/struct dot operator even on m128 types.I suppose that this is so by design.I observed this pattern in my code also.
Simple workaround could be using inline assembly or intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Inline assembly is hard to maintain in the context of portability, so intrinsics is the way to go but I'm still trying to find a way to avoid the load/store which are not removed during optimization (like i show in my second point) (it should be part of the RVO). the vectorcall calling convention seems to allow the whole function block to be fed with xmm registers and to return via xmm0 when possible, but I don't see any exemple where the swap from/to float[4] and _m128 is done implicitly.
Someone suggested using the union trick however I prefer to not break any coding rules (strict aliasing) which could induce undefined optimization behavior..
If anyone knows the solution I'd be really really thankful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This might be a little faster:
sqrtpd xmm0,xmmword ptr [v4] movpd xmm6,xmmword ptr [rbp+0B0h] sqrtpd xmm6,xmm6 addpd xmm6,xmm0 movaps xmm5,xmm6 unpckhpd xmm5,xmm4 addsd xmm6,xmm5
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
This might be a little faster:
sqrtpd xmm0,xmmword ptr [v4] movpd xmm6,xmmword ptr [rbp+0B0h] sqrtpd xmm6,xmm6 addpd xmm6,xmm0 movaps xmm5,xmm6 unpckhpd xmm5,xmm4 addsd xmm6,xmm5Jim Dempsey
Why adding a movpd would make it faster ?!? if so, why not doing it for the "xmmword ptr [v4]" too ? Thx for your reply.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that using a reinterpret cast is ok ... but is it portable ? I'll check when I find some time and compilers to try with.
return *(reinterpret_cast<double*>(&y));
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Why adding a movpd would make it faster ?!? >>>
I suppose that performing sqrt operation on value which is already in register could be faster in terms of cpu cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
<<< I suppose that performing sqrt operation on value which is already in register could be faster in terms of cpu cycles. >>>
The sqrtpd with an xmmword as second operand will take the same or even one less cpu cycles than a movpd + sqrtpd(xmm#1, xmm#2). Is it turning into a game for the best answer flag ?
Main question remaining : how to avoid the "punning" and "store" solutions to return an optimizable floating point type variable (kept as an xmm register when returned inline) from an __m128 local variable ? The mm_loads are not in the disasm (so actually optimized), but the mm_store stays for no reason... can some actual ICC dev answer ?? Thanks..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>he sqrtpd with an xmmword as second operand will take the same or even one less cpu cycles than a movpd + sqrtpd(xmm#1, xmm#2). >>>
The sqrtpd with register-memory operand will be decoded probably in two uops.One to load variable and second to perform sqrt operation.In terms of uops sqrtpd instruction with reg-reg operand and movpd reg-mem both of them will be probably decoded in two uops.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page