float min_reduce(__m128 x)

Matthias_Kretz · ‎06-05-2009

I'm looking for an efficient way to implement a min_reduce on an __m128 vector. As far as I've seen there's no instruction available to do this so I tried the following:

[cpp]float min_reduce(__m128 a) {
    a = _mm_min_ps(a, _mm_movehl_ps(a, a));   // a = min(a0, a2), min(a1, a3), min(a2, a2), min(a3, a3)
    a = _mm_min_ss(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 1, 1))); // a = min(a0, a1), a1, a2, a3
    float r;
    _mm_store_ss(&r, a);
    return r;
}[/cpp]

It seems to work (at least for the cases I tested). But it looks more complicated to me than necessary. Is there something more efficient, or is this the best I can get already?

Matthias_Kretz · ‎06-05-2009

While on that topic. What's the same thing for a vector of shorts? Here's my idea:

[cpp]short min(__m128i a) {
    a = _mm_min_epi16(a, _mm_shuffle_epi32  (a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 1, 1, 1)));
    return _mm_cvtsi128_si32(a); // & 0xffff is implicit  
}[/cpp]

That's quite a long dependency chain:
PSHUFD -> PMINSW -> PSHUFW -> PMINSW -> PSHUFW -> PMINSW -> MOVD
(only if the compiler creates the necessary MOVs such that they can run in parallel with the PSHUF*).
Any better ideas?

neni · ‎06-05-2009

For FP, what you have is probably the best, pre-penryn target you might want to look at srlq,32 for the 2nd shuffle (pshuflw)
for shorts, if you know your values are always psoitive and have sse4 target, you can use phminpos

Matthias_Kretz · ‎06-08-2009

So srlq is faster on older processors and shufps is faster on newer processors where you have the 2 cycle penalty from going from a float vector -> int vector -> float vector, right?

phminpos I somehow overlooked. Ah, because it's not documented at http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm. Thanks for the pointer.