- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm looking for an efficient way to implement a min_reduce on an __m128 vector. As far as I've seen there's no instruction available to do this so I tried the following:
[cpp]float min_reduce(__m128 a) { a = _mm_min_ps(a, _mm_movehl_ps(a, a)); // a = min(a0, a2), min(a1, a3), min(a2, a2), min(a3, a3) a = _mm_min_ss(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 1, 1))); // a = min(a0, a1), a1, a2, a3 float r; _mm_store_ss(&r, a); return r; }[/cpp]It seems to work (at least for the cases I tested). But it looks more complicated to me than necessary. Is there something more efficient, or is this the best I can get already?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While on that topic. What's the same thing for a vector of shorts? Here's my idea:
PSHUFD -> PMINSW -> PSHUFW -> PMINSW -> PSHUFW -> PMINSW -> MOVD
(only if the compiler creates the necessary MOVs such that they can run in parallel with the PSHUF*).
Any better ideas?
[cpp]short min(__m128i a) { a = _mm_min_epi16(a, _mm_shuffle_epi32 (a, _MM_SHUFFLE(1, 0, 3, 2))); a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 0, 3, 2))); a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 1, 1, 1))); return _mm_cvtsi128_si32(a); // & 0xffff is implicit }[/cpp]That's quite a long dependency chain:
PSHUFD -> PMINSW -> PSHUFW -> PMINSW -> PSHUFW -> PMINSW -> MOVD
(only if the compiler creates the necessary MOVs such that they can run in parallel with the PSHUF*).
Any better ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For FP, what you have is probably the best, pre-penryn target you might want to look at srlq,32 for the 2nd shuffle (pshuflw)
for shorts, if you know your values are always psoitive and have sse4 target, you can use phminpos
for shorts, if you know your values are always psoitive and have sse4 target, you can use phminpos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So srlq is faster on older processors and shufps is faster on newer processors where you have the 2 cycle penalty from going from a float vector -> int vector -> float vector, right?
phminpos I somehow overlooked. Ah, because it's not documented at http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm. Thanks for the pointer.
phminpos I somehow overlooked. Ah, because it's not documented at http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm. Thanks for the pointer.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page