Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

branching on a double quadword

David_DiLaura1
New Contributor I
1,579 Views
I'm working on computational geometry code that is based entirely on quadrangles. This has made the code high vectorizable. At one point I'm testing a certain parameter of all 4 sides to see if they are all < 0. I'm doing a vector compare, so all four values get tested with one xmm compare instruction.
If all four parameters ARE < 0 then the compare instruction completely fills the destination with ones -- the destination being 4 aligned 4-byte logicals. I can use the result of this comparison by test each of the logicals. But . . . is there a way to detect for ones across the entire double quad word all at once? That is, one logical check, rather than four?
I'm currently using an early-out arrangement: I check the first 4-byte logical, and if it indicates that the first parameter is < 0, then I go on to check the 2nd, and so on. A nested set of 4 if's. But I'm interested to know if a "vector check" across the entire double quad word (if it can be done) would in general be faster.
David
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,579 Views
Sorry for the type-o

The correct mnemonic is PSADBW (compute sum of absolute differences). This can be use to (sort of) perform a horizontal add under some circumstances. The code snippet I gave was incorrect and after looking at the code again, it was incomplete. The idea was to use this instruction to count the bytes == 0xFF.

The code though bloats up to (untested code):

[plain]movaps  xmm0,[somewhere]; xmm0 = read of 4 floats
pxor xmm1,xmm1 ; xmm1 = 4 0.0's
cmpgtps xmm0,xmm1 ; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)
pasdbw xmm0,xmm1 ; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1
; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1
; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0
; now check for double 8's
pshufd xmm0,xmm0,0330 ; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0
pasdbw xmm0,xmm1 ; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0
pextrw eax,xmm1,0 ; extract low word of xmm1 into eax
test eax,16

[/plain]

>>How are the xmm registered manipulated directly in Fortran?

Your original post said you were using SSE instructions - so I assume assembler (which you can mix with FORTRAN).

If you want to stick with FORTRAN then do as one of the earlier suggestions - use the logical functions on a union with the floats (real(4)'s) e.g.

! ARRAY I(1:4) OVERLAYSARRAY OF4 REAL(4)'S
IF(RSHFT(I(1),31)+RSHFT(I(2),31)+RSHFT(I(3),31)+RSHIFT(I(4),31) .EQ. 4) THEN...

Check the code out with optimizations enabled, this may do a good enough job.

Note, there is one "minor" error in the above. Floating point numbers have +0.0 and -0.0 (did you know this?). The above will include -0.0 in the set of .lt.0. You will have to decide if this is correct or not.

Jim Dempsey

View solution in original post

0 Kudos
8 Replies
Paul_Curtis
Valued Contributor I
1,579 Views
Define a conformal 16-byte integer EQUIVALENCED with your double-quad-word set of logicals, then IAND this with a predefined conformal bitmask (integer) such that the result will be 0 or 1 depending on whether the test is met. This is intrinsically more efficient than the 4 sequential tests you describe, but only if you can load the dqw to be tested from your (array of?) 4-byte logicals in one step.
0 Kudos
David_DiLaura1
New Contributor I
1,579 Views
Paul,
Are there 16-byte integers?
I (and most of our customers) are not on a 64-bit OS. (XP, a few Vistas, and a fair number of Win7-32).
David
0 Kudos
TimP
Honored Contributor III
1,579 Views
Operations on an xmm register are the same in 32- or 64-bit mode.
I'm not sure from your description why you wouldn't use something like if(any(logicalarray)), leaving it up to the compiler to decide whether to use masking operations, or, as hinted in previous responses, use masking intrinsics explicitly.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,579 Views
Try something along the lines of this:

movaps xmm0,[somewhere] ; xmm0 = read of 4 floats

pxor xmm1,xmm1 ; xmm1 = 4 0.0's
cmpgeps xmm1,xmm0 ; xmm1 = .lt.0 indicators
pasdbw xmm2,xmm1 ; xmm2 (lsw) = sum of absolute differences in bytes
; 0 when all flags 0 or all flags FFFFFFFF (-1)
; +n when flags differ
paddd xmm2,xmm1 ; xmm2(lsdw) -1 only when all flags were FFFFFFFF
; .ge. 0 when not
movss temp,xmm2 ;
test temp,0
...


Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,579 Views
Why cann't these stupid html editors preserve formatting without requireing the user to hoop jump!
Grrr
0 Kudos
David_DiLaura1
New Contributor I
1,579 Views
Jim,
I've learned to pay attention to your suggestions on this forum -- and that's my excuse for asking for clarifications. I understand the logic of your suggestion, but:
1. What is pasdbw?
2. How are the xmm registered manipulated directly in Fortran? I know assembler, but that doesn't do me much good these days, other than to know machine instructions and to be able to fully use what VTune reports about code. We can't poke our own assembler code into Fortran, can we? (I understand that can be done with C compliers).
What I was hoping to do is find source code that 'forces' the compiler to use the movmskps instruction: pack 4 sign bits from a vector of 4 floats into a single 4-bit mask in a 32-bit regester. That way I can test a single 4-byte int.
What I'm doing now is holding the 4 results of the logical compare of the 4 floats in a 4-byte logical array. That logical array is equivalenced to an 8-byte int array. That lets me look at two 8-byte ints to see if they're both -1. If so, all four of the original logicals where -1. The compiler needs to load an additional copy of the xmm registers and then logically shift it to get to the 2nd 8-byte element in order to see if it is -1.Kludge.
The ANY function has been suggested. Bad idea, it seems to me. Many months of looking at VTune analyses has shown me that the ANY function (at least for the 4-spans I'm using) is never efficient. I've never once seen the compiler vectorize an ANY function.
David
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,580 Views
Sorry for the type-o

The correct mnemonic is PSADBW (compute sum of absolute differences). This can be use to (sort of) perform a horizontal add under some circumstances. The code snippet I gave was incorrect and after looking at the code again, it was incomplete. The idea was to use this instruction to count the bytes == 0xFF.

The code though bloats up to (untested code):

[plain]movaps  xmm0,[somewhere]; xmm0 = read of 4 floats
pxor xmm1,xmm1 ; xmm1 = 4 0.0's
cmpgtps xmm0,xmm1 ; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)
pasdbw xmm0,xmm1 ; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1
; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1
; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0
; now check for double 8's
pshufd xmm0,xmm0,0330 ; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0
pasdbw xmm0,xmm1 ; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0
pextrw eax,xmm1,0 ; extract low word of xmm1 into eax
test eax,16

[/plain]

>>How are the xmm registered manipulated directly in Fortran?

Your original post said you were using SSE instructions - so I assume assembler (which you can mix with FORTRAN).

If you want to stick with FORTRAN then do as one of the earlier suggestions - use the logical functions on a union with the floats (real(4)'s) e.g.

! ARRAY I(1:4) OVERLAYSARRAY OF4 REAL(4)'S
IF(RSHFT(I(1),31)+RSHFT(I(2),31)+RSHFT(I(3),31)+RSHIFT(I(4),31) .EQ. 4) THEN...

Check the code out with optimizations enabled, this may do a good enough job.

Note, there is one "minor" error in the above. Floating point numbers have +0.0 and -0.0 (did you know this?). The above will include -0.0 in the set of .lt.0. You will have to decide if this is correct or not.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,579 Views
Grrr. The HTML "code" viewer is now trashing formatting. Here is the code sampe pasted in using "text" paste not "code" past

movaps xmm0,[somewhere]; xmm0 = read of 4 floats

psrad xmm0,31 ; xmm0 = shift of sign bits through entire dword

pxor xmm1,xmm1 ; xmm1 = 4 0.0's

cmpgtps xmm0,xmm1 ; xmm0 = xmm0.gt.xmm1 indicators (xmm1 still all 0's)

pasdbw xmm0,xmm1 ; xmm0 (lsw of low half) = sum of absolute differences in bytes 0:7 of xmm0-xmm1

; xmm0 (lsw of high half) = sum of absolute differences in bytes 8:15 of xmm0-xmm1

; When words of xmm1 = {0,0,0,8,0,0,0,8} indicates all 4 floats were < 0

; now check for double 8's

pshufd xmm0,xmm0,0330 ; shuffle dwords 11,01,10,00 {0,0,8,8} indicates all 4 floats were < 0

pasdbw xmm0,xmm1 ; low half xmm0 words {0,0,0,16} indicates all 4 floats were < 0

pextrw eax,xmm1,0 ; extract low word of xmm1 into eax

test eax,16


Jim

0 Kudos
Reply