horizontal float min using SSE

QFang1 · ‎12-26-2010

Dear SSE experts:

I have two __m128 variables, T and S, both contain 4 floats. Now I need to find the minimum value in T and the index of the minimum, is it straightforward to do using SSE?

What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?

thanks in advance

Qianqian

JenniferJ · ‎01-11-2011

I checked with our compiler engineers. There is intrinsics min/minpos for 16 bits int of _m128, but not for floats.

It might be better to transfer your question to the CPU forum to see they could help you better.

thanks,
Jennifer

levicki · ‎01-12-2011

Let me try to help a bit.

; xmm0 = T
; xmm1 = S
; xmm7 = 0.0f

xorps xmm7, xmm7 ; xmm7 = 0.0f
cmpltps xmm7, xmm1 ; 0.0f < S

; if (S == -1) then xmm7 = 0x00000000
; if (S == 1) then xmm7 = 0xFFFFFFFF

andps xmm0, xmm7

; now in T you have only elements that had positive values in S (i.e. greater than 0.0f) and the rest of the elements are zeroed out.

Unfortunately, there is no horizontal minimum for float values, but if you know the range of values you could scale them (by multiplying with a constant) so they fit in unsigned short and then use PHMINPOSUW to find the minimum and its index.

EDIT:

You could then find horizontal minimum in T by doing something like this:

pshufd xmm2, xmm0, 0x4E ; copy T into xmm2 reordering elements from 3 2 1 0 to 1 0 3 2 order
minps xmm2, xmm0 ; find minimum
pshufd xmm3, xmm2, 0xB1 ; copy result into xmm3 and reorder as 2 3 0 1
minps xmm3, xmm2 ; find minimum (now you have minimum value in all elements)
movss dword ptr [min_val], xmm3 ; this is your horizontal minimum
cmpeqps xmm3, xmm0 ; you will have 0xFFFFFFFF in place of a minimum and 0x00000000 elsewhere
movmskps edx, xmm3 ; extract sign mask (if minimum was in position 2 you will have 0100b in edx)
bsr eax, edx ; convert bit position to index
mov dword ptr [min_ndx], eax ; this is your minimum index

Please note that I haven't tested the above code for correctness -- I wrote it off the top of my head. There is also a doubt whether it would perform any faster than the scalar code. Let me know if it helps.

jimdempseyatthecove · ‎01-12-2011

Igor,

check this out

xmm0 = T

minps xmm1,xmm0 ; min float of xmm0 into lowest floatof xmm1
(save min value if you want)
pshufd xmm1,xmm1,0 ; propigatelowest floatto all floats
cmpeqps xmm1,xmm0 ; identify location(s) of min
movmskps edx, xmm1 ; sign bits to edx
bsf eax, edx ; eax gets index of first lowest bit set in edx

Jim Dempsey

levicki · ‎01-12-2011

Jim,

Not sure that would be enough, he said:

"What if I want to only use the floats in T that have positive values in S and do the above horizontal min operation?"

Also, MINPS returns minimum values for each pair of src/dst elements, not the horizontal minimum. To get horizontal minimum you need two MINPS and two PSHUFD instructions.

levicki · ‎01-13-2011

I had some time today to think about this and perform some testing, and I have realized that the code I wrote initially has three problems:

1. It does not select proper elements when you have zero in S

Namely, if you use CMPLTPS on S, you will get only elements greater than zero for your selection mask -- this is incorrect, because zero can also be a positive value.

If you use CMPLEPS on S instead, then you will get elements greater or equal to zero for your selection mask.

Unfortunately, since the CPU treats positive and negative zero as equal in comparison you will also include negative zero in your selection mask.

It is possible to solve this by creating the selection mask by right-shifting arithmetically 31 times (which replicates the sign bit of elements in S), and then performing ANDNPS instead of ANDPS. Of course, this still does not handle SNaN, QNaN, denormals, etc but at least it performs correctly with both positive and negative zero in S.

2. It does not take into account that the remaining values in T might all be greater than (or equal to) zero

Namely, when you mask out elements by ANDNPS, you are replacing them with zero. If the remaining elements are all greater than (or equal to) zero, you will find zero as a minimum which is incorrect. To solve this problem, you need to replace the elements you masked out with FLT_MAX.

3. If there is more than one minimum in T it returns index of the last minimum found instead of first one

Namely, if you did this code in C by iterating through T in a for (...) loop, you would get index of the first minimum found. This can be solved easily by replacing BSR with BSF.

Here is the final code:

[cpp]#include 
#include 

float	S[4] = { -1.0f,  1.0f,  0.0f, -0.0f };
float	T[4] = { -3.0f, -5.0f,  1.5f,  7.0f };
float	M[4] = { FLT_MAX, FLT_MAX, FLT_MAX, FLT_MAX };

__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM)
{
	__asm	{
		mov		eax, dword ptr [esp + 8]
		movups		xmm0, xmmword ptr [eax] ; load S values into XMM0
		mov		eax, dword ptr [esp + 12]
		movups		xmm1, xmmword ptr [eax] ; load T values into XMM1
		mov		eax, dword ptr [esp + 16]
		movups		xmm3, xmmword ptr [eax] ; load M values into XMM3
		psrad		xmm0, 31 ; create selection mask by replicating sign bits of S values
		andps		xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values
		andnps		xmm0, xmm1 ; leave values in T only in positions where S had positive values
		orps		xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0
		pshufd		xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032
		minps		xmm1, xmm0 ; find minimum
		pshufd		xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301
		minps		xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2)
		mov		eax, dword ptr [esp + 4]
		movss		dword ptr [eax], xmm2 ; save minimum value
		cmpeqps		xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere
		movmskps	edx, xmm2 ; extract sign bits to edx
		bsf		eax, edx ; convert sign bit position into element index
		ret ; return with minimum index in eax
	}
}

int main(int argc, char* argv[])
{
	float	val;
	int	ndx;
	
	ndx = index_and_minimum_from_T_for_positive_S(val, S, T, M);

	printf("value = %.2f, index = %ldn", val, ndx);

	return 0;
}
[/cpp]

So, what the above assembler code does?

- It selects elements from T which have positive value in S (including positive zero)
- It finds horizontal minimum from selected elements in T
- It finds zero-based index of said minimum
- If there is more than one minimum in T it returns the index of the first one found

I hope that this is what the original poster wanted, and that someone can translate it into intrinsics for them.

jimdempseyatthecove · ‎01-13-2011

Excellent work Igor.

Some improvement in memory latencies can be made by rearranging some statements.

[cpp]__declspec(naked) int index_and_minimum_from_T_for_positive_S(float &min_val, float *pS, float *pT, float *pM)
{
  __asm	{ ; eax, edx and ecx in 32-bit are available w/o requirement for save/restore
    mov eax, dword ptr [esp + 8]  ;; throughput 1 clock, latency (L1) 7 clocks
    mov edx, dword ptr [esp + 12] 
    mov ecx, dword ptr [esp + 16]
    ;; stalls here for 4 clocks (remaining latecy on eax)
    ;; after stall eax ready, edx comes in at +1 clock, ecx comes in +2 clocks
    movups xmm0, xmmword ptr [eax] ; load S values into XMM0
    ;; latency for above move from memory is 12 clocks (architecture dependent)
    ;; no stall on next two loads (begin load during latency of above load
    movups xmm1, xmmword ptr [edx] ; load T values into XMM1
    movups xmm3, xmmword ptr [ecx] ; load M values into XMM3
    ;; stall for remaining 9 clocks latency on load of xmm0
    psrad xmm0, 31 ; create selection mask by replicating sign bits of S values
    andps xmm3, xmm0 ; leave FLT_MAX values in M only in positions where S had negative values
    andnps xmm0, xmm1 ; leave values in T only in positions where S had positive values
    orps xmm0, xmm3 ; replace masked-out values in T with FLT_MAX in case remaining T values are all >= 0
    pshufd xmm1, xmm0, 0x4E ; copy xmm0 to xmm1 and reorder 3210 as 1032
    minps xmm1, xmm0 ; find minimum
    pshufd xmm2, xmm1, 0xB1 ; copy xmm1 to xmm2 and reorder 3210 as 2301
    minps xmm2, xmm1 ; find minimum (now minimum is in all elements of xmm2)
    mov eax, dword ptr [esp + 4]
    movss dword ptr [eax], xmm2 ; save minimum value
    cmpeqps xmm2, xmm0 ; compare T with minimum -- 0xFFFFFFFF in position of minimum, zero elsewhere
    movmskps edx, xmm2 ; extract sign bits to edx
    bsf eax, edx ; convert sign bit position into element index
    ret ; return with minimum index in eax
   }
}
[/cpp]

Jim Dempsey

jimdempseyatthecove · ‎01-13-2011

After posting - edit/add

And after load of xmm3 you would prefetch the pointer to the result
mov eax, dword ptr [esp + 4]
so it will be ready for use of write of minimum value.

Jim Dempsey

JenniferJ · ‎01-13-2011

Very nice work Jim and Igor. Thank you so much for sharing your expertise here.

I checked with the compiler engineers again and a Feature Request to adda new or a few new intrinsics to do such work is appropriate.

Thank you again for your time and effort to help others!

Jennifer

levicki · ‎01-13-2011

Jim,

Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?

Have you considered rewriting my code as intrinsics and letting the compiler reorder the instructions? I am currently a bit busy to try that myself.

Jennifer,

You are welcome.

Regarding new intrinsics -- I am not sure how common of a task is selecting elements from one register where another one has positive values in the same position.

However, horizontal minimum for packed floats is really missing, and its missing from the x86 instruction set (all that talk about orthogonality... tsk, tsk, tsk... one would think the lesson has been learned already).

Having intrinsics for hypothetical HMINPS and HMINPOSPS instructions would help people who are not very experienced with SIMD programming.

levicki · ‎01-19-2011

Further improvement to this code can in theory be accomplished by reducing the number of domain transitions (every int to float transition and vice versa costs one clock). Namely, by replacing first three MOVUPS by MOVDQU, ANDPS with PAND, ANDNPS with PANDN, and ORPS with POR.

jimdempseyatthecove · ‎01-19-2011

>>Please clarify, do your comments about stalls still apply after reordering or are they just marking places where the stalls were present before you reordered them?

The following table lists latencies for an arbitrary CPU
This table will/may change depending on CPU

Original code red highlight shows latencies
Each line approximately one clock tick

41 clocks to ... (*** right most column should be [pM] *** )
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted 6 clocks (immediately following last movups above)
Therefor, estimate for above is 35 clocks.

The CPU is pipelined and can have multiple instructions in flight (and out of order)
The following chart is an approximation of what can happen.

15 clocks to ...
however, the psrad (next instruction not shown), has a dependency on xmm0
and could be hoisted2 clocks (4 clocks immediately following last movups above)
Therefor, estimate for above is13 clocks

I should mentioned that the mov ecx in the improved method could potentially
get stalled depending on if the port assignment and acquisitionoccures prior to
or following the fetch from L1. The block charts in the optimization guide do not
specify these details.

Jim Dempsey

jimdempseyatthecove · ‎01-19-2011

*** I have to revise those charts. ***

The latency information was derived from:

IA-32 Intel Architecture Optimization Reference Manual

Order Number: 248966-012 June 2005

So it is clearly out of date.

In looking at theJan2011 doc 248966-023

We have different latencies posted.

In particular movaps latencies
CPUID(s) latency/throughput
0F_2H,0F_3H =6/1
06_(..) = 1/0.33

L1 3/1
L2 ~14/2
L3 ~110/12

Clearly the 06 series have corrected a bottleneck

Igor, your code as originally written would experience only a minor improvement by reordering the instructions.

Jim Dempsey

jimdempseyatthecove · ‎01-19-2011

And here is the revised chart

old code ~16 clocks thru psrad
new code ~9 clocks thru psrad

Jim Dempsey

levicki · ‎01-19-2011

Jim,

Did you take into account float to int transitions and vice versa? They introduce 1 cycle latency. I suggested a code change to reduce the number of domain transitions. Could you check again?