Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

float min_reduce(__m128 x)

Matthias_Kretz
New Contributor I
287 Views
I'm looking for an efficient way to implement a min_reduce on an __m128 vector. As far as I've seen there's no instruction available to do this so I tried the following:
[cpp]float min_reduce(__m128 a) {
    a = _mm_min_ps(a, _mm_movehl_ps(a, a));   // a = min(a0, a2), min(a1, a3), min(a2, a2), min(a3, a3)
    a = _mm_min_ss(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 1, 1))); // a = min(a0, a1), a1, a2, a3
    float r;
    _mm_store_ss(&r, a);
    return r;
}[/cpp]
It seems to work (at least for the cases I tested). But it looks more complicated to me than necessary. Is there something more efficient, or is this the best I can get already?
0 Kudos
3 Replies
Matthias_Kretz
New Contributor I
287 Views
While on that topic. What's the same thing for a vector of shorts? Here's my idea:
[cpp]short min(__m128i a) {
    a = _mm_min_epi16(a, _mm_shuffle_epi32  (a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 1, 1, 1)));
    return _mm_cvtsi128_si32(a); // & 0xffff is implicit  
}[/cpp]
That's quite a long dependency chain:
PSHUFD -> PMINSW -> PSHUFW -> PMINSW -> PSHUFW -> PMINSW -> MOVD
(only if the compiler creates the necessary MOVs such that they can run in parallel with the PSHUF*).
Any better ideas?
0 Kudos
neni
New Contributor II
287 Views
For FP, what you have is probably the best, pre-penryn target you might want to look at srlq,32 for the 2nd shuffle (pshuflw)
for shorts, if you know your values are always psoitive and have sse4 target, you can use phminpos

0 Kudos
Matthias_Kretz
New Contributor I
287 Views
So srlq is faster on older processors and shufps is faster on newer processors where you have the 2 cycle penalty from going from a float vector -> int vector -> float vector, right?

phminpos I somehow overlooked. Ah, because it's not documented at http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm. Thanks for the pointer.
0 Kudos
Reply