- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear SSE experts:
I have two __m128 variables, T and S, both contain 4 floats. Now I need to find the minimum value in T and the index of the minimum, is it straightforward to do using SSE?
What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?
thanks in advance
Qianqian
I have two __m128 variables, T and S, both contain 4 floats. Now I need to find the minimum value in T and the index of the minimum, is it straightforward to do using SSE?
What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?
thanks in advance
Qianqian
Link Copied
14 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked with our compiler engineers. There is intrinsics min/minpos for 16 bits int of _m128, but not for floats.
It might be better to transfer your question to the CPU forum to see they could help you better.
thanks,
Jennifer
It might be better to transfer your question to the CPU forum to see they could help you better.
thanks,
Jennifer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me try to help a bit.
; xmm0 = T
; xmm1 = S
; xmm7 = 0.0f
xorps xmm7, xmm7 ; xmm7 = 0.0f
cmpltps xmm7, xmm1 ; 0.0f < S
; if (S == -1) then xmm7 = 0x00000000
; if (S == 1) then xmm7 = 0xFFFFFFFF
andps xmm0, xmm7
; now in T you have only elements that had positive values in S (i.e. greater than 0.0f) and the rest of the elements are zeroed out.
Unfortunately, there is no horizontal minimum for float values, but if you know the range of values you could scale them (by multiplying with a constant) so they fit in unsigned short and then use PHMINPOSUW to find the minimum and its index.
EDIT:
You could then find horizontal minimum in T by doing something like this:
pshufd xmm2, xmm0, 0x4E ; copy T into xmm2 reordering elements from 3 2 1 0 to 1 0 3 2 order
minps xmm2, xmm0 ; find minimum
pshufd xmm3, xmm2, 0xB1 ; copy result into xmm3 and reorder as 2 3 0 1
minps xmm3, xmm2 ; find minimum (now you have minimum value in all elements)
movss dword ptr [min_val], xmm3 ; this is your horizontal minimum
cmpeqps xmm3, xmm0 ; you will have 0xFFFFFFFF in place of a minimum and 0x00000000 elsewhere
movmskps edx, xmm3 ; extract sign mask (if minimum was in position 2 you will have 0100b in edx)
bsr eax, edx ; convert bit position to index
mov dword ptr [min_ndx], eax ; this is your minimum index
Please note that I haven't tested the above code for correctness -- I wrote it off the top of my head. There is also a doubt whether it would perform any faster than the scalar code. Let me know if it helps.
; xmm0 = T
; xmm1 = S
; xmm7 = 0.0f
xorps xmm7, xmm7 ; xmm7 = 0.0f
cmpltps xmm7, xmm1 ; 0.0f < S
; if (S == -1) then xmm7 = 0x00000000
; if (S == 1) then xmm7 = 0xFFFFFFFF
andps xmm0, xmm7
; now in T you have only elements that had positive values in S (i.e. greater than 0.0f) and the rest of the elements are zeroed out.
Unfortunately, there is no horizontal minimum for float values, but if you know the range of values you could scale them (by multiplying with a constant) so they fit in unsigned short and then use PHMINPOSUW to find the minimum and its index.
EDIT:
You could then find horizontal minimum in T by doing something like this:
pshufd xmm2, xmm0, 0x4E ; copy T into xmm2 reordering elements from 3 2 1 0 to 1 0 3 2 order
minps xmm2, xmm0 ; find minimum
pshufd xmm3, xmm2, 0xB1 ; copy result into xmm3 and reorder as 2 3 0 1
minps xmm3, xmm2 ; find minimum (now you have minimum value in all elements)
movss dword ptr [min_val], xmm3 ; this is your horizontal minimum
cmpeqps xmm3, xmm0 ; you will have 0xFFFFFFFF in place of a minimum and 0x00000000 elsewhere
movmskps edx, xmm3 ; extract sign mask (if minimum was in position 2 you will have 0100b in edx)
bsr eax, edx ; convert bit position to index
mov dword ptr [min_ndx], eax ; this is your minimum index
Please note that I haven't tested the above code for correctness -- I wrote it off the top of my head. There is also a doubt whether it would perform any faster than the scalar code. Let me know if it helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor,
check this out
xmm0 = T
minps xmm1,xmm0 ; min float of xmm0 into lowest floatof xmm1
(save min value if you want)
pshufd xmm1,xmm1,0 ; propigatelowest floatto all floats
cmpeqps xmm1,xmm0 ; identify location(s) of min
movmskps edx, xmm1 ; sign bits to edx
bsf eax, edx ; eax gets index of first lowest bit set in edx
Jim Dempsey
check this out
xmm0 = T
minps xmm1,xmm0 ; min float of xmm0 into lowest floatof xmm1
(save min value if you want)
pshufd xmm1,xmm1,0 ; propigatelowest floatto all floats
cmpeqps xmm1,xmm0 ; identify location(s) of min
movmskps edx, xmm1 ; sign bits to edx
bsf eax, edx ; eax gets index of first lowest bit set in edx
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Not sure that would be enough, he said:
"What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?"
Also, MINPS returns minimum values for each pair of src/dst elements, not the horizontal minimum. To get horizontal minimum you need two MINPS and two PSHUFD instructions.
Not sure that would be enough, he said:
"What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?"
Also, MINPS returns minimum values for each pair of src/dst elements, not the horizontal minimum. To get horizontal minimum you need two MINPS and two PSHUFD instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had some time today to think about this and perform some testing, and I have realized that the code I wrote initially has three problems:
1. It does not select proper elements when you have zero in S
Namely, if you use CMPLTPS on S, you will get only elements greater than zero for your selection mask -- this is incorrect, because zero can also be a positive value.
If you use CMPLEPS on S instead, then you will get elements greater or equal to zero for your selection mask.
Unfortunately, since the CPU treats positive and negative zero as equal in comparison you will also include negative zero in your selection mask.
It is possible to solve this by creating the selection mask by right-shifting arithmetically 31 times (which replicates the sign bit of elements in S), and then performing ANDNPS instead of ANDPS. Of course, this still does not handle SNaN, QNaN, denormals, etc but at least it performs correctly with both positive and negative zero in S.
2. It does not take into account that the remaining values in T might all be greater than (or equal to) zero
Namely, when you mask out elements by ANDNPS, you are replacing them with zero. If the remaining elements are all greater than (or equal to) zero, you will find zero as a minimum which is incorrect. To solve this problem, you need to replace the elements you masked out with FLT_MAX.
3. If there is more than one minimum in T it returns index of the last minimum found instead of first one
Namely, if you did this code in C by iterating through T in a for (...) loop, you would get index of the first minimum found. This can be solved easily by replacing BSR with BSF.
Here is the final code:
- It selects elements from T which have positive value in S (including positive zero)
- It finds horizontal minimum from selected elements in T
- It finds zero-based index of said minimum
- If there is more than one minimum in T it returns the index of the first one found
I hope that this is what the original poster wanted, and that someone can translate it into intrinsics for them.
1. It does not select proper elements when you have zero in S
Namely, if you use CMPLTPS on S, you will get only elements greater than zero for your selection mask -- this is incorrect, because zero can also be a positive value.
If you use CMPLEPS on S instead, then you will get elements greater or equal to zero for your selection mask.
Unfortunately, since the CPU treats positive and negative zero as equal in comparison you will also include negative zero in your selection mask.
It is possible to solve this by creating the selection mask by right-shifting arithmetically 31 times (which replicates the sign bit of elements in S), and then performing ANDNPS instead of ANDPS. Of course, this still does not handle SNaN, QNaN, denormals, etc but at least it performs correctly with both positive and negative zero in S.
2. It does not take into account that the remaining values in T might all be greater than (or equal to) zero
Namely, when you mask out elements by ANDNPS, you are replacing them with zero. If the remaining elements are all greater than (or equal to) zero, you will find zero as a minimum which is incorrect. To solve this problem, you need to replace the elements you masked out with FLT_MAX.
3. If there is more than one minimum in T it returns index of the last minimum found instead of first one
Namely, if you did this code in C by iterating through T in a for (...) loop, you would get index of the first minimum found. This can be solved easily by replacing BSR with BSF.
Here is the final code:
[cpp]#includeSo, what the above assembler code does?
#include
float S[4] = { -1.0f, 1.0f, 0.0f, -0.0f };
float T[4] = { -3.0f, -5.0f, 1.5f, 7.0f };
float M[4] = { FLT_MAX, FLT_MAX, FLT_MAX, FLT_MAX };
__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM)
{
__asm {
mov eax, dword ptr [esp + 8]
movups xmm0, xmmword ptr [eax] ; load S values into XMM0
mov eax, dword ptr [esp + 12]
movups xmm1, xmmword ptr [eax] ; load T values into XMM1
mov eax, dword ptr [esp + 16]
movups xmm3, xmmword ptr [eax] ; load M values into XMM3
psrad xmm0, 31 ; create selection mask by replicating sign bits of S values
andps xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values
andnps xmm0, xmm1 ; leave values in T only in positions where S had positive values
orps xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0
pshufd xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032
minps xmm1, xmm0 ; find minimum
pshufd xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301
minps xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2)
mov eax, dword ptr [esp + 4]
movss dword ptr [eax], xmm2 ; save minimum value
cmpeqps xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere
movmskps edx, xmm2 ; extract sign bits to edx
bsf eax, edx ; convert sign bit position into element index
ret ; return with minimum index in eax
}
}
int main(int argc, char* argv[])
{
float val;
int ndx;
ndx = index_and_minimum_from_T_for_positive_S(val, S, T, M);
printf("value = %.2f, index = %ldn", val, ndx);
return 0;
}
[/cpp]
- It selects elements from T which have positive value in S (including positive zero)
- It finds horizontal minimum from selected elements in T
- It finds zero-based index of said minimum
- If there is more than one minimum in T it returns the index of the first one found
I hope that this is what the original poster wanted, and that someone can translate it into intrinsics for them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Excellent work Igor.
Some improvement in memory latencies can be made by rearranging some statements.
Jim Dempsey
Some improvement in memory latencies can be made by rearranging some statements.
[cpp]__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM) { __asm { ; eax, edx and ecx in 32-bit are available w/o requirement for save/restore mov eax, dword ptr [esp + 8] ;; throughput 1 clock, latency (L1) 7 clocks mov edx, dword ptr [esp + 12] mov ecx, dword ptr [esp + 16] ;; stalls here for 4 clocks (remaining latecy on eax) ;; after stall eax ready, edx comes in at +1 clock, ecx comes in +2 clocks movups xmm0, xmmword ptr [eax] ; load S values into XMM0 ;; latency for above move from memory is 12 clocks (architecture dependent) ;; no stall on next two loads (begin load during latency of above load movups xmm1, xmmword ptr [edx] ; load T values into XMM1 movups xmm3, xmmword ptr [ecx] ; load M values into XMM3 ;; stall for remaining 9 clocks latency on load of xmm0 psrad xmm0, 31 ; create selection mask by replicating sign bits of S values andps xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values andnps xmm0, xmm1 ; leave values in T only in positions where S had positive values orps xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0 pshufd xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032 minps xmm1, xmm0 ; find minimum pshufd xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301 minps xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2) mov eax, dword ptr [esp + 4] movss dword ptr [eax], xmm2 ; save minimum value cmpeqps xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere movmskps edx, xmm2 ; extract sign bits to edx bsf eax, edx ; convert sign bit position into element index ret ; return with minimum index in eax } } [/cpp]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After posting - edit/add
And after load of xmm3 you would prefetch the pointer to the result
mov eax, dword ptr [esp + 4]
so it will be ready for use of write of minimum value.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Very nice work Jim and Igor. Thank you so much for sharing your expertise here.
I checked with the compiler engineers again and a Feature Request to adda new or a few new intrinsics to do such work is appropriate.
Thank you again for your time and effort to help others!
Jennifer
I checked with the compiler engineers again and a Feature Request to adda new or a few new intrinsics to do such work is appropriate.
Thank you again for your time and effort to help others!
Jennifer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?
Have you considered rewriting my code as intrinsics and letting the compiler reorder the instructions? I am currently a bit busy to try that myself.
Jennifer,
You are welcome.
Regarding new intrinsics -- I am not sure how common of a task is selecting elements from one register where another one has positive values in the same position.
However, horizontal minimum for packed floats is really missing, and its missing from the x86 instruction set (all that talk about orthogonality... tsk, tsk, tsk... one would think the lesson has been learned already).
Having intrinsics for hypothetical HMINPS and HMINPOSPS instructions would help people who are not very experienced with SIMD programming.
Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?
Have you considered rewriting my code as intrinsics and letting the compiler reorder the instructions? I am currently a bit busy to try that myself.
Jennifer,
You are welcome.
Regarding new intrinsics -- I am not sure how common of a task is selecting elements from one register where another one has positive values in the same position.
However, horizontal minimum for packed floats is really missing, and its missing from the x86 instruction set (all that talk about orthogonality... tsk, tsk, tsk... one would think the lesson has been learned already).
Having intrinsics for hypothetical HMINPS and HMINPOSPS instructions would help people who are not very experienced with SIMD programming.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Further improvement to this code can in theory be accomplished by reducing the number of domain transitions (every int to float transition and vice versa costs one clock). Namely, by replacing first three MOVUPS by MOVDQU, ANDPS with PAND, ANDNPS with PANDN, and ORPS with POR.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?
The following table lists latencies for an arbitrary CPU
This table will/may change depending on CPU

Original code red highlight shows latencies
Each line approximately one clock tick

41 clocks to ... (*** right most column should be [pM] *** )
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted 6 clocks (immediately following last movups above)
Therefor, estimate for above is 35 clocks.
The CPU is pipelined and can have multiple instructions in flight (and out of order)
The following chart is an approximation of what can happen.

15 clocks to ...
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted2 clocks (4 clocks immediately following last movups above)
Therefor, estimate for above is13 clocks
I should mentioned that the mov ecx in the improved method could potentially
get stalled depending on if the port assignment and acquisitionoccures prior to
or following the fetch from L1. The block charts in the optimization guide do not
specify these details.
Jim Dempsey
The following table lists latencies for an arbitrary CPU
This table will/may change depending on CPU

Original code red highlight shows latencies
Each line approximately one clock tick

41 clocks to ... (*** right most column should be [pM] *** )
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted 6 clocks (immediately following last movups above)
Therefor, estimate for above is 35 clocks.
The CPU is pipelined and can have multiple instructions in flight (and out of order)
The following chart is an approximation of what can happen.

15 clocks to ...
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted2 clocks (4 clocks immediately following last movups above)
Therefor, estimate for above is13 clocks
I should mentioned that the mov ecx in the improved method could potentially
get stalled depending on if the port assignment and acquisitionoccures prior to
or following the fetch from L1. The block charts in the optimization guide do not
specify these details.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** I have to revise those charts. ***
The latency information was derived from:
The latency information was derived from:
IA-32 Intel Architecture Optimization Reference Manual
Order Number: 248966-012 June 2005
So it is clearly out of date.
In looking at theJan2011 doc 248966-023
We have different latencies posted.
In particular movaps latencies
CPUID(s) latency/throughput
0F_2H,0F_3H =6/1
06_(..) = 1/0.33
L1 3/1
L2 ~14/2
L3 ~110/12
Clearly the 06 series have corrected a bottleneck
Igor, your code as originally written would experience only a minor improvement by reordering the instructions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And here is the revised chart
old code ~16 clocks thru psrad
new code ~9 clocks thru psrad
Jim Dempsey

old code ~16 clocks thru psrad
new code ~9 clocks thru psrad
Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Did you take into account float to int transitions and vice versa? They introduce 1 cycle latency. I suggested a code change to reduce the number of domain transitions. Could you check again?
Did you take into account float to int transitions and vice versa? They introduce 1 cycle latency. I suggested a code change to reduce the number of domain transitions. Could you check again?

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page