Solved: Fastest way to AND across packed doubles?

christian_convey · ‎01-19-2012

Anyone have a suggestion for what SIMD instructions are best for this on an AVX chip?

I've got two 4-element vectors, A and B,of packed doubles.

For corresponding elements of A and B, I want to know whether or not A is numerically less than B.

But then, I want to know if any of those four comparisons had an answer of "true". So essentially I want to to a logical "and" across the register containing the results of those four comparisons. Is there an efficient way to do this?

bronxzv · ‎01-19-2012

for the example at hand, have a look at these intrinsics:
_mm256_cmp_pd
_mm256_testz_pd

I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using
_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything

View solution in original post

sirrida · ‎01-19-2012

What about this?:

vcmppd ymm1,ymm1,ymm2,5 ; not less than
vmovmskpd eax,ymm1
test eax,eax
jz ...

The 5 means "not less than". If none of these tests succeed, eax will become zero which means that all are less.

An alternative could be this:

vcmppd ymm1,ymm1,ymm2,5 ; not less than
vxorpd ymm0,ymm0,ymm0
vptest ymm0,ymm1
jc ...

I cannot test which is faster.

Aftermath:
The solution from bronxzv below seems to be the best:
vcmppd ymm1,ymm1,ymm2,1 ; less than
vptest ymm1,ymm1
jnz ...

christian_convey · ‎01-19-2012

I'm new to SIMD programming, but what you wrote looks promising. So is the basic idea as follows?

The "vcmppd" operation will populate every bit in the destination register with a 1 or 0, and that includes the sign bit. Then "vmovmskpd" gathers the sign bits from all two or four packed elements, which is how we get the results of all four comparisons into a single scalara register. Then we just test that register for all zeroes?

Also, do you happen do know if I can use Intel C++ intrinsics to pull off what you've written? I'm already deviating from simple C++ by using intrinsics. If possible I'd like to avoid another non-C++ construct: assembly.

Thanks again for your help.

bronxzv · ‎01-19-2012

I'll suggest to usesimply VTESTPD ymm1,ymm1(then branch onZ flag to test for all 0s) after VCMPPD, it's generally faster than VMOVMSKPD + TEST for testing for all 0s

bronxzv · ‎01-19-2012

for the example at hand, have a look at these intrinsics:
_mm256_cmp_pd
_mm256_testz_pd

I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using
_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything

TimP · ‎01-19-2012

Doesn't cilk+ have syntax for this (assuming you have a bias against Fortran)? According to the description, you want any(a(:)If you have only a single operation like this it's not even worth the effort to find out what it takes to get parallel instructions, let alone to ask for advice.

bronxzv · ‎01-19-2012

I can't speak for the original poster but it's quite common to see such code in critical loops (well at least in my field) so it's arguablyinteresting to optimize