- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anyone have a suggestion for what SIMD instructions are best for this on an AVX chip?
I've got two 4-element vectors, A and B,of packed doubles.
For corresponding elements of A and B, I want to know whether or not A is numerically less than B.
But then, I want to know if any of those four comparisons had an answer of "true". So essentially I want to to a logical "and" across the register containing the results of those four comparisons. Is there an efficient way to do this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
_mm256_cmp_pd
_mm256_testz_pd
I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using
_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vcmppd ymm1,ymm1,ymm2,5 ; not less than
vmovmskpd eax,ymm1
test eax,eax
jz ...
The 5 means "not less than". If none of these tests succeed, eax will become zero which means that all are less.
An alternative could be this:
vcmppd ymm1,ymm1,ymm2,5 ; not less than
vxorpd ymm0,ymm0,ymm0
vptest ymm0,ymm1
jc ...
I cannot test which is faster.
Aftermath:
The solution from bronxzv below seems to be the best:
vcmppd ymm1,ymm1,ymm2,1 ; less than
vptest ymm1,ymm1
jnz ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm new to SIMD programming, but what you wrote looks promising. So is the basic idea as follows?
The "vcmppd" operation will populate every bit in the destination register with a 1 or 0, and that includes the sign bit. Then "vmovmskpd" gathers the sign bits from all two or four packed elements, which is how we get the results of all four comparisons into a single scalara register. Then we just test that register for all zeroes?
Also, do you happen do know if I can use Intel C++ intrinsics to pull off what you've written? I'm already deviating from simple C++ by using intrinsics. If possible I'd like to avoid another non-C++ construct: assembly.
Thanks again for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
_mm256_cmp_pd
_mm256_testz_pd
I'll suggest to use intrinsics instead of assembly since it's more future proof, for example by using
_mm256_mul_pd & _mm256_add_pd in your code you'll be automatically able to use FMA instructions in AVX2 (with the Intel compiler) without a single line of code change, in ASM you'll have to rewrite everything
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page