- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Anyone have a suggestion for what SIMD instructions are best for this on an AVX chip?

I've got two 4-element vectors, A and B,of packed doubles.

For corresponding elements of A and B, I want to know whether or not A* is numerically less than B .*

But then, I want to know if any of those four comparisons had an answer of "true". So essentially I want to to a logical "and" across the register containing the results of those four comparisons. Is there an efficient way to do this?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

_mm256_cmp_pd

_mm256_testz_pd

I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using

_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

vcmppd ymm1,ymm1,ymm2,5 ; not less than

vmovmskpd eax,ymm1

test eax,eax

jz ...

The 5 means "not less than". If none of these tests succeed, eax will become zero which means that all are less.

An alternative could be this:

vcmppd ymm1,ymm1,ymm2,5 ; not less than

vxorpd ymm0,ymm0,ymm0

vptest ymm0,ymm1

jc ...

I cannot test which is faster.

Aftermath:

The solution from bronxzv below seems to be the best:

vcmppd ymm1,ymm1,ymm2,1 ; less than

vptest ymm1,ymm1

jnz ...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I'm new to SIMD programming, but what you wrote looks promising. So is the basic idea as follows?

The "vcmppd" operation will populate every bit in the destination register with a 1 or 0, and that includes the sign bit. Then "vmovmskpd" gathers the sign bits from all two or four packed elements, which is how we get the results of all four comparisons into a single scalara register. Then we just test that register for all zeroes?

Also, do you happen do know if I can use Intel C++ intrinsics to pull off what you've written? I'm already deviating from simple C++ by using intrinsics. If possible I'd like to avoid another non-C++ construct: assembly.

Thanks again for your help.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

_mm256_cmp_pd

_mm256_testz_pd

I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using

_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**If you have only a single operation like this it's not even worth the effort to find out what it takes to get parallel instructions, let alone to ask for advice.**

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page