- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have two __m128 variables, T and S, both contain 4 floats. Now I need to find the minimum value in T and the index of the minimum, is it straightforward to do using SSE?

What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?

thanks in advance

Qianqian

Link Copied

14 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It might be better to transfer your question to the CPU forum to see they could help you better.

thanks,

Jennifer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

; xmm0 = T

; xmm1 = S

; xmm7 = 0.0f

xorps xmm7, xmm7 ; xmm7 = 0.0f

cmpltps xmm7, xmm1 ; 0.0f < S

; if (S == -1) then xmm7 = 0x00000000

; if (S == 1) then xmm7 = 0xFFFFFFFF

andps xmm0, xmm7

; now in T you have only elements that had positive values in S (i.e. greater than 0.0f) and the rest of the elements are zeroed out.

Unfortunately, there is no horizontal minimum for float values, but if you know the range of values you could scale them (by multiplying with a constant) so they fit in unsigned short and then use PHMINPOSUW to find the minimum and its index.

EDIT:

You could then find horizontal minimum in T by doing something like this:

pshufd xmm2, xmm0, 0x4E ; copy T into xmm2 reordering elements from 3 2 1 0 to 1 0 3 2 order

minps xmm2, xmm0 ; find minimum

pshufd xmm3, xmm2, 0xB1 ; copy result into xmm3 and reorder as 2 3 0 1

minps xmm3, xmm2 ; find minimum (now you have minimum value in all elements)

movss dword ptr [min_val], xmm3 ; this is your horizontal minimum

cmpeqps xmm3, xmm0 ; you will have 0xFFFFFFFF in place of a minimum and 0x00000000 elsewhere

movmskps edx, xmm3 ; extract sign mask (if minimum was in position 2 you will have 0100b in edx)

bsr eax, edx ; convert bit position to index

mov dword ptr [min_ndx], eax ; this is your minimum index

Please note that I haven't tested the above code for correctness -- I wrote it off the top of my head. There is also a doubt whether it would perform any faster than the scalar code. Let me know if it helps.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

check this out

xmm0 = T

minps xmm1,xmm0 ; min float of xmm0 into lowest floatof xmm1

(save min value if you want)

pshufd xmm1,xmm1,0 ; propigatelowest floatto all floats

cmpeqps xmm1,xmm0 ; identify location(s) of min

movmskps edx, xmm1 ; sign bits to edx

bsf eax, edx ; eax gets index of first lowest bit set in edx

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Not sure that would be enough, he said:

"

*What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?*"

Also, MINPS returns minimum values for each pair of src/dst elements, not the horizontal minimum. To get horizontal minimum you need two MINPS and two PSHUFD instructions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**1. It does not select proper elements when you have zero in S**

Namely, if you use CMPLTPS on S, you will get only elements greater than zero for your selection mask -- this is incorrect, because zero can also be a positive value.

If you use CMPLEPS on S instead, then you will get elements greater or equal to zero for your selection mask.

Unfortunately, since the CPU treats positive and negative zero as equal in comparison you will also include negative zero in your selection mask.

It is possible to solve this by creating the selection mask by right-shifting arithmetically 31 times (which replicates the sign bit of elements in S), and then performing ANDNPS instead of ANDPS. Of course, this still does not handle SNaN, QNaN, denormals, etc but at least it performs correctly with both positive and negative zero in S.

**2. It does not take into account that the remaining values in T might all be greater than (or equal to) zero**

Namely, when you mask out elements by ANDNPS, you are replacing them with zero. If the remaining elements are all greater than (or equal to) zero, you will find zero as a minimum which is incorrect. To solve this problem, you need to replace the elements you masked out with FLT_MAX.

**3. If there is more than one minimum in T it returns index of the last minimum found instead of first one**

Namely, if you did this code in C by iterating through T in a for (...) loop, you would get index of the first minimum found. This can be solved easily by replacing BSR with BSF.

Here is the final code:

[cpp]#includeSo, what the above assembler code does?

#include

float S[4] = { -1.0f, 1.0f, 0.0f, -0.0f };

float T[4] = { -3.0f, -5.0f, 1.5f, 7.0f };

float M[4] = { FLT_MAX, FLT_MAX, FLT_MAX, FLT_MAX };

__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM)

{

__asm {

mov eax, dword ptr [esp + 8]

movups xmm0, xmmword ptr [eax] ; load S values into XMM0

mov eax, dword ptr [esp + 12]

movups xmm1, xmmword ptr [eax] ; load T values into XMM1

mov eax, dword ptr [esp + 16]

movups xmm3, xmmword ptr [eax] ; load M values into XMM3

psrad xmm0, 31 ; create selection mask by replicating sign bits of S values

andps xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values

andnps xmm0, xmm1 ; leave values in T only in positions where S had positive values

orps xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0

pshufd xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032

minps xmm1, xmm0 ; find minimum

pshufd xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301

minps xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2)

mov eax, dword ptr [esp + 4]

movss dword ptr [eax], xmm2 ; save minimum value

cmpeqps xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere

movmskps edx, xmm2 ; extract sign bits to edx

bsf eax, edx ; convert sign bit position into element index

ret ; return with minimum index in eax

}

}

int main(int argc, char* argv[])

{

float val;

int ndx;

ndx = index_and_minimum_from_T_for_positive_S(val, S, T, M);

printf("value = %.2f, index = %ldn", val, ndx);

return 0;

}

[/cpp]

- It selects elements from T which have positive value in S (including positive zero)

- It finds horizontal minimum from selected elements in T

- It finds zero-based index of said minimum

- If there is more than one minimum in T it returns the index of the first one found

I hope that this is what the original poster wanted, and that someone can translate it into intrinsics for them.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Some improvement in memory latencies can be made by rearranging some statements.

[cpp]__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM) { __asm { ; eax, edx and ecx in 32-bit are available w/o requirement for save/restore mov eax, dword ptr [esp + 8] ;; throughput 1 clock, latency (L1) 7 clocks mov edx, dword ptr [esp + 12] mov ecx, dword ptr [esp + 16] ;; stalls here for 4 clocks (remaining latecy on eax) ;; after stall eax ready, edx comes in at +1 clock, ecx comes in +2 clocks movups xmm0, xmmword ptr [eax] ; load S values into XMM0 ;; latency for above move from memory is 12 clocks (architecture dependent) ;; no stall on next two loads (begin load during latency of above load movups xmm1, xmmword ptr [edx] ; load T values into XMM1 movups xmm3, xmmword ptr [ecx] ; load M values into XMM3 ;; stall for remaining 9 clocks latency on load of xmm0 psrad xmm0, 31 ; create selection mask by replicating sign bits of S values andps xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values andnps xmm0, xmm1 ; leave values in T only in positions where S had positive values orps xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0 pshufd xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032 minps xmm1, xmm0 ; find minimum pshufd xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301 minps xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2) mov eax, dword ptr [esp + 4] movss dword ptr [eax], xmm2 ; save minimum value cmpeqps xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere movmskps edx, xmm2 ; extract sign bits to edx bsf eax, edx ; convert sign bit position into element index ret ; return with minimum index in eax } } [/cpp]

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

And after load of xmm3 you would prefetch the pointer to the result

mov eax, dword ptr [esp + 4]

so it will be ready for use of write of minimum value.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I checked with the compiler engineers again and a Feature Request to adda new or a few new intrinsics to do such work is appropriate.

Thank you again for your time and effort to help others!

Jennifer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?

Have you considered rewriting my code as intrinsics and letting the compiler reorder the instructions? I am currently a bit busy to try that myself.

Jennifer,

You are welcome.

Regarding new intrinsics -- I am not sure how common of a task is selecting elements from one register where another one has positive values in the same position.

However, horizontal minimum for packed floats is really missing, and its missing from the x86 instruction set (all that talk about orthogonality... tsk, tsk, tsk... one would think the lesson has been learned already).

Having intrinsics for hypothetical HMINPS and HMINPOSPS instructions would help people who are not very experienced with SIMD programming.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The following table lists latencies for an arbitrary CPU

This table will/may change depending on CPU

Original code red highlight shows latencies

Each line approximately one clock tick

41 clocks to ... (*** right most column should be [pM] *** )

however, the psrad (next instruction not shown), has a dependency on xmm0

and could be hoisted 6 clocks (immediately following last movups above)

Therefor, estimate for above is 35 clocks.

The CPU is pipelined and can have multiple instructions in flight (and out of order)

The following chart is an approximation of what can happen.

15 clocks to ...

however, the psrad (next instruction not shown), has a dependency on xmm0

and could be hoisted2 clocks (4 clocks immediately following last movups above)

Therefor, estimate for above is13 clocks

I should mentioned that the mov ecx in the improved method could potentially

get stalled depending on if the port assignment and acquisitionoccures prior to

or following the fetch from L1. The block charts in the optimization guide do not

specify these details.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The latency information was derived from:

IA-32 Intel Architecture Optimization Reference Manual

Order Number: 248966-012 June 2005

So it is clearly out of date.

In looking at theJan2011 doc 248966-023

We have different latencies posted.

In particular movaps latencies

CPUID(s) latency/throughput

0F_2H,0F_3H =6/1

06_(..) = 1/0.33

L1 3/1

L2 ~14/2

L3 ~110/12

Clearly the 06 series have corrected a bottleneck

Igor, your code as originally written would experience only a minor improvement by reordering the instructions.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

old code ~16 clocks thru psrad

new code ~9 clocks thru psrad

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Did you take into account float to int transitions and vice versa? They introduce 1 cycle latency. I suggested a code change to reduce the number of domain transitions. Could you check again?

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page