Solved: branching on a double quadword

David_DiLaura1 · ‎03-23-2010

I'm working on computational geometry code that is based entirely on quadrangles. This has made the code high vectorizable. At one point I'm testing a certain parameter of all 4 sides to see if they are all < 0. I'm doing a vector compare, so all four values get tested with one xmm compare instruction.

If all four parameters ARE < 0 then the compare instruction completely fills the destination with ones -- the destination being 4 aligned 4-byte logicals. I can use the result of this comparison by test each of the logicals. But . . . is there a way to detect for ones across the entire double quad word all at once? That is, one logical check, rather than four?

I'm currently using an early-out arrangement: I check the first 4-byte logical, and if it indicates that the first parameter is < 0, then I go on to check the 2nd, and so on. A nested set of 4 if's. But I'm interested to know if a "vector check" across the entire double quad word (if it can be done) would in general be faster.

David

jimdempseyatthecove · ‎03-24-2010

Sorry for the type-o

The correct mnemonic is PSADBW (compute sum of absolute differences). This can be use to (sort of) perform a horizontal add under some circumstances. The code snippet I gave was incorrect and after looking at the code again, it was incomplete. The idea was to use this instruction to count the bytes == 0xFF.

The code though bloats up to (untested code):

[plain]movaps  xmm0,[somewhere]; xmm0 = read of 4 floats
pxor    xmm1,xmm1	; xmm1 = 4 0.0's
cmpgtps xmm0,xmm1	; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)
pasdbw	xmm0,xmm1	; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1
			; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1
			; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0
			; now check for double 8's
pshufd	xmm0,xmm0,0330	; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0
pasdbw	xmm0,xmm1	; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0
pextrw	eax,xmm1,0	; extract low word of xmm1 into eax
test	eax,16

[/plain]

>>How are the xmm registered manipulated directly in Fortran?

Your original post said you were using SSE instructions - so I assume assembler (which you can mix with FORTRAN).

If you want to stick with FORTRAN then do as one of the earlier suggestions - use the logical functions on a union with the floats (real(4)'s) e.g.

! ARRAY I(1:4) OVERLAYSARRAY OF4 REAL(4)'S
IF(RSHFT(I(1),31)+RSHFT(I(2),31)+RSHFT(I(3),31)+RSHIFT(I(4),31) .EQ. 4) THEN...

Check the code out with optimizations enabled, this may do a good enough job.

Note, there is one "minor" error in the above. Floating point numbers have +0.0 and -0.0 (did you know this?). The above will include -0.0 in the set of .lt.0. You will have to decide if this is correct or not.

Jim Dempsey

View solution in original post

Paul_Curtis · ‎03-23-2010

Define a conformal 16-byte integer EQUIVALENCED with your double-quad-word set of logicals, then IAND this with a predefined conformal bitmask (integer) such that the result will be 0 or 1 depending on whether the test is met. This is intrinsically more efficient than the 4 sequential tests you describe, but only if you can load the dqw to be tested from your (array of?) 4-byte logicals in one step.

David_DiLaura1 · ‎03-23-2010

Paul,

Are there 16-byte integers?

I (and most of our customers) are not on a 64-bit OS. (XP, a few Vistas, and a fair number of Win7-32).

David

TimP · ‎03-23-2010

Operations on an xmm register are the same in 32- or 64-bit mode.
I'm not sure from your description why you wouldn't use something like if(any(logicalarray)), leaving it up to the compiler to decide whether to use masking operations, or, as hinted in previous responses, use masking intrinsics explicitly.

jimdempseyatthecove · ‎03-24-2010

Try something along the lines of this:

movaps xmm0,[somewhere] ; xmm0 = read of 4 floats

pxor xmm1,xmm1 ; xmm1 = 4 0.0's
cmpgeps xmm1,xmm0 ; xmm1 = .lt.0 indicators
pasdbw xmm2,xmm1 ; xmm2 (lsw) = sum of absolute differences in bytes
; 0 when all flags 0 or all flags FFFFFFFF (-1)
; +n when flags differ
paddd xmm2,xmm1 ; xmm2(lsdw) -1 only when all flags were FFFFFFFF
; .ge. 0 when not
movss temp,xmm2 ;
test temp,0
...

Jim Dempsey

jimdempseyatthecove · ‎03-24-2010

Why cann't these stupid html editors preserve formatting without requireing the user to hoop jump!
Grrr

David_DiLaura1 · ‎03-24-2010

Jim,

I've learned to pay attention to your suggestions on this forum -- and that's my excuse for asking for clarifications. I understand the logic of your suggestion, but:

1. What is pasdbw?

2. How are the xmm registered manipulated directly in Fortran? I know assembler, but that doesn't do me much good these days, other than to know machine instructions and to be able to fully use what VTune reports about code. We can't poke our own assembler code into Fortran, can we? (I understand that can be done with C compliers).

What I was hoping to do is find source code that 'forces' the compiler to use the movmskps instruction: pack 4 sign bits from a vector of 4 floats into a single 4-bit mask in a 32-bit regester. That way I can test a single 4-byte int.

What I'm doing now is holding the 4 results of the logical compare of the 4 floats in a 4-byte logical array. That logical array is equivalenced to an 8-byte int array. That lets me look at two 8-byte ints to see if they're both -1. If so, all four of the original logicals where -1. The compiler needs to load an additional copy of the xmm registers and then logically shift it to get to the 2nd 8-byte element in order to see if it is -1.Kludge.

The ANY function has been suggested. Bad idea, it seems to me. Many months of looking at VTune analyses has shown me that the ANY function (at least for the 4-spans I'm using) is never efficient. I've never once seen the compiler vectorize an ANY function.

David

jimdempseyatthecove · ‎03-24-2010

Sorry for the type-o

The correct mnemonic is PSADBW (compute sum of absolute differences). This can be use to (sort of) perform a horizontal add under some circumstances. The code snippet I gave was incorrect and after looking at the code again, it was incomplete. The idea was to use this instruction to count the bytes == 0xFF.

The code though bloats up to (untested code):

[plain]movaps  xmm0,[somewhere]; xmm0 = read of 4 floats
pxor    xmm1,xmm1	; xmm1 = 4 0.0's
cmpgtps xmm0,xmm1	; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)
pasdbw	xmm0,xmm1	; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1
			; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1
			; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0
			; now check for double 8's
pshufd	xmm0,xmm0,0330	; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0
pasdbw	xmm0,xmm1	; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0
pextrw	eax,xmm1,0	; extract low word of xmm1 into eax
test	eax,16

[/plain]

>>How are the xmm registered manipulated directly in Fortran?

Your original post said you were using SSE instructions - so I assume assembler (which you can mix with FORTRAN).

If you want to stick with FORTRAN then do as one of the earlier suggestions - use the logical functions on a union with the floats (real(4)'s) e.g.

! ARRAY I(1:4) OVERLAYSARRAY OF4 REAL(4)'S
IF(RSHFT(I(1),31)+RSHFT(I(2),31)+RSHFT(I(3),31)+RSHIFT(I(4),31) .EQ. 4) THEN...

Check the code out with optimizations enabled, this may do a good enough job.

Note, there is one "minor" error in the above. Floating point numbers have +0.0 and -0.0 (did you know this?). The above will include -0.0 in the set of .lt.0. You will have to decide if this is correct or not.

Jim Dempsey

jimdempseyatthecove · ‎03-24-2010

Grrr. The HTML "code" viewer is now trashing formatting. Here is the code sampe pasted in using "text" paste not "code" past

movaps xmm0,[somewhere]; xmm0 = read of 4 floats

psrad xmm0,31 ; xmm0 = shift of sign bits through entire dword

pxor xmm1,xmm1 ; xmm1 = 4 0.0's

cmpgtps xmm0,xmm1 ; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)

pasdbw xmm0,xmm1 ; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1

; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1

; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0

; now check for double 8's

pshufd xmm0,xmm0,0330 ; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0

pasdbw xmm0,xmm1 ; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0

pextrw eax,xmm1,0 ; extract low word of xmm1 into eax

test eax,16

Jim