Solved: Inline assembly is hard to

Even_E_ · ‎03-24-2014

Hello,

I'd like to talk about two weird things in the optimization process of the compiler.

#1 : sqrtsd seems preferred over sqrtpd... (just did a 30% performance boost by forcing the use of 2 sqrtpd instead of the 4 sqrtsd previously generated in my code). I know it's not often that sqrt is used sequentially so I won't mind if that feature doesn't appear. However here comes #2.

#2 : in an intrinsics based function the result must be _mm_store..'d to be returned so I expected some RVO to appear in the assembly in the case the returned value is to be loaded in an xmm just after that. Instead there is a pair of store/load to/from an unused local stack variable that could be simplified by a mov between 2 xmm registers. I find that a bit strange being used to see ICC getting rid of everything it can, making dead code elimination a hell to avoid for performance tests btw :)

Note : the code is compiled with the Ox flag by the last ICC integrated into MSVC2013.

Thank you in advance for your answers.

An Intel fan, trying to optimize even what doesn't need to be optimized.

Even_E_ · ‎04-17-2014

It seems that using a reinterpret cast is ok ... but is it portable ? I'll check when I find some time and compilers to try with.

return *(reinterpret_cast<double*>(&y));

View solution in original post

TimP · ‎03-24-2014

1) If you mean packing independent sqrt calculations into a parallel instruction, the compiler doesn't do it automatically. You might persuade it by packing them into a short vector or a _m128 when using the target instruction set which you mention. The potential gain for this would depend on specific CPU models, among other things (Harpertown, Ivy Bridge, and Haswell have significantly lower sqrt latencies than others, hence less gain for packing 2 sqrts). Sandy Bridge and Ivy Bridge don't gain as much as Haswell should by packing 4 of them rather than 2 into a single parallel instruction, but such distinctions aren't likely to be made by a compiler, beyond the extent to which you could make it happen by specifying a _m128 when building under AVX. But I could be mis-guessing your meaning.

2) I don't think the situation you describe is clear, but when using intrinsics, you could explicitly carry an operand past an mm_store, so the compiler would not necessarily be wrong to take your code literally if you asked for store and reload.

Bernard · ‎03-26-2014

As @Tim hinted your data probably is not packed in _m128d union thus forcing compiler to use sqrtsd instruction.

Even_E_ · ‎04-16-2014

I'll try to be more clear this time. The structure used is like that :

struct __declspec(align(16)) vec4d
{   // in fact it's highly templated, but this is equivalent to the one instianciated.
    union { double x; double r; double v[1]; }
    union { double y; double g; }
    union { double z; double b; }
    union { double w; double a; }
};

1) Differences between the disassemblies :

double sum_sqrt_simple()
{
    return std::sqrt(v[0]) + std::sqrt(v[1]) + std::sqrt(v[2]) + std::sqrt(v[3]);
}

movsd       xmm1,mmword ptr [v4]
movsd       xmm0,mmword ptr [rbp+0A8h]
sqrtsd      xmm1,xmm1  
sqrtsd      xmm0,xmm0  
movsd       xmm2,mmword ptr [rbp+0B0h]
addsd       xmm1,xmm0  
sqrtsd      xmm2,xmm2  
movsd       xmm3,mmword ptr [rbp+0B8h]
addsd       xmm1,xmm2  
sqrtsd      xmm3,xmm3  
addsd       xmm1,xmm3

whereas a faster way (at least on my machine, but that's why i'm asking) would be :

sqrtpd      xmm6,xmmword ptr [v4]  
sqrtpd      xmm0,xmmword ptr [rbp+0B0h]  
addpd       xmm6,xmm0  
movaps      xmm5,xmm6  
unpckhpd    xmm5,xmm4  
addsd       xmm6,xmm5

2) The thing is by implementing the second way manually with that :

double sum_sqrt_opti()
{
    double ret;
    __m128d x = _mm_load_pd(v);
    __m128d y = _mm_load_pd(v + 2);
    x = _mm_sqrt_pd(x);
    y = _mm_sqrt_pd(y);
    x = _mm_add_pd(x, y);
    y = _mm_unpackhi_pd(x, y);
    y = _mm_add_sd(x, y);
    _mm_storel_pd(&ret, y);
    return ret;
}

it generates that : ( at the return point when the call is used as an rvalue in an expression, eg : std::cout << vec.sum_sqrt_opti(); )

movlpd      qword ptr [rbp],xmm0  
movsd       xmm6,mmword ptr [rbp]

I hope it's more clear that way, thanks in advance for your updated answers :)

Bernard · ‎04-16-2014

It seems that compiler will chose to use scalar version of SSE instructions set when your code is using union/struct dot operator even on m128 types.I suppose that this is so by design.I observed this pattern in my code also.

Simple workaround could be using inline assembly or intrinsics.

Even_E_ · ‎04-17-2014

Inline assembly is hard to maintain in the context of portability, so intrinsics is the way to go but I'm still trying to find a way to avoid the load/store which are not removed during optimization (like i show in my second point) (it should be part of the RVO). the vectorcall calling convention seems to allow the whole function block to be fed with xmm registers and to return via xmm0 when possible, but I don't see any exemple where the swap from/to float[4] and _m128 is done implicitly.

Someone suggested using the union trick however I prefer to not break any coding rules (strict aliasing) which could induce undefined optimization behavior..

If anyone knows the solution I'd be really really thankful.

jimdempseyatthecove · ‎04-17-2014

This might be a little faster:

sqrtpd      xmm0,xmmword ptr [v4]  
movpd       xmm6,xmmword ptr [rbp+0B0h]  
sqrtpd      xmm6,xmm6  
addpd       xmm6,xmm0  
movaps      xmm5,xmm6  
unpckhpd    xmm5,xmm4  
addsd       xmm6,xmm5

Jim Dempsey

Even_E_ · ‎04-17-2014

jimdempseyatthecove wrote:

This might be a little faster:

sqrtpd      xmm0,xmmword ptr [v4]  
movpd       xmm6,xmmword ptr [rbp+0B0h]  
sqrtpd      xmm6,xmm6  
addpd       xmm6,xmm0  
movaps      xmm5,xmm6  
unpckhpd    xmm5,xmm4  
addsd       xmm6,xmm5

Jim Dempsey

Why adding a movpd would make it faster ?!? if so, why not doing it for the "xmmword ptr [v4]" too ? Thx for your reply.

Even_E_ · ‎04-17-2014

It seems that using a reinterpret cast is ok ... but is it portable ? I'll check when I find some time and compilers to try with.

return *(reinterpret_cast<double*>(&y));

Bernard · ‎04-17-2014

>>>Why adding a movpd would make it faster ?!? >>>

I suppose that performing sqrt operation on value which is already in register could be faster in terms of cpu cycles.

Even_E_ · ‎04-17-2014

<<< I suppose that performing sqrt operation on value which is already in register could be faster in terms of cpu cycles. >>>

The sqrtpd with an xmmword as second operand will take the same or even one less cpu cycles than a movpd + sqrtpd(xmm#1, xmm#2). Is it turning into a game for the best answer flag ?

Main question remaining : how to avoid the "punning" and "store" solutions to return an optimizable floating point type variable (kept as an xmm register when returned inline) from an __m128 local variable ? The mm_loads are not in the disasm (so actually optimized), but the mm_store stays for no reason... can some actual ICC dev answer ?? Thanks..

Bernard · ‎04-17-2014

>>>he sqrtpd with an xmmword as second operand will take the same or even one less cpu cycles than a movpd + sqrtpd(xmm#1, xmm#2). >>>

The sqrtpd with register-memory operand will be decoded probably in two uops.One to load variable and second to perform sqrt operation.In terms of uops sqrtpd instruction with reg-reg operand and movpd reg-mem both of them will be probably decoded in two uops.

Optimization features.