Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Comparing two vectors of floats returns a vector of floats

nemequ
New Contributor I
1,244 Views

It should return a vector of integers.  The GCC documentation is pretty clear:

Vector comparison is supported with standard comparison operators: ==, !=, <, <=, >, >=. Comparison operands can be vector expressions of integer-type or real-type. Comparison between integer-type vectors and real-type vectors are not supported. The result of the comparison is a vector of the same width and number of elements as the comparison operands with a signed integral element type.

Note the last sentence.

GCC and clang both return the expected result, but ICC has other ideas and I end up with 1 << 31 instead of -1 for true.

Luckily the workaround is pretty straightforward: an explicit cast to the correct type.

0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
1,244 Views

This is a bug.

Nice observation by the way.

Have you observed that behavior using your local compiler as opposed to what is on godbolt.org?

The code listed there does not make sense:

cmp:
        cmpeqps   xmm0, xmm1                                    #7.15
        cvttss2si eax, DWORD PTR .L_2il0floatpacket.0[rip]      #7.15
        movd      xmm2, eax                                     #7.15
        pshufd    xmm3, xmm2, 0                                 #7.15
        pand      xmm0, xmm3                                    #7.15
        ret                                                     #7.15
.L_2il0floatpacket.0:
        .long   0xffffffff

The first instruction, cmpeqps, produces the desired result.

The second instruction is perplexing in that it is taking the binary bit pattern 0xffffffff and treating that as a float and converting that to an int (it could have kept this as a constant). Note, the float representation is a -QuietNaN and note that it is (may be) undefined as to what to do when you convert a -QuietNaN to integer. The fact that you end up with 0x80000000 implies that the hardware is converting a -QuietNaN to -0 (aka 0x80000000).
The third instruction then takes the converted -QuietNaN to int and places it into xmm2 lane 0
The fourth instruction then duplicates lane 0 to lanes 1, 2 and 3 (all lanes are converted -QuietNaNs 0x80000000)
The fifth instruction then integer ANDs with the original results producing 0x80000000 or 0x00000000 in respective lanes

Applying an explicit cast and getting the desired values is accidental to say the least.

Jim Dempsey

0 Kudos
nemequ
New Contributor I
1,244 Views

Yes, I initially found the issue while working on SIMDe.

clang actually emits a diagnostic (-Wvector-conversion) if you assign the result of a comparison of two vectors of floats to a vector floats.  I thought that was a good warning, so I cleaned up the underlying issue and my ICC CI build started failing.

Here is the test I put together locally:

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include <inttypes.h>

typedef float vecf __attribute__((__vector_size__(16)));
typedef int32_t veci __attribute__((__vector_size__(16)));

veci cmp(vecf a, vecf b) {
    return a == b;
}

int main(void) {
  vecf a = { 1.0f, 2.0f, 3.0f, 4.0f };
  vecf b = { 1.0f, 1.0f, 3.0f, 3.0f };
  veci r;

  r = cmp(a, b);

  for (int i = 0 ; i < (sizeof(r) / sizeof(r[0])) ; i++) {
    printf("[%d] %" PRId32 "\n", i, r);
    assert(r == ((i & 1) ? INT32_C(0) : ~INT32_C(0)));
  }

  printf("Success\n");

  return 0;
}

It works with clang and gcc, but not ICC.

Yeah, the assembly really threw me.  At first I was thinking the cvttss2si was a cvttps2dq which, though wrong, at least sort of made sense.  But then when I saw the pshfd and looked back at the cvttss2si I realized something was very wrong.

I'm thinking the compiler saw that the return value was an int32_t, so it did a cvttss2si.  Then it realized that it's actually a vector of int32_t, so it moved the int32_t back into a new vector before returning.

Anyways, my project is fixed for now.  I just wanted to report it so Intel can fix it, and hopefully that will save someone else some confusion :)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,244 Views

Be careful about making assumptions. CLang (OpenCL) verses GCC verses ICL verses other vendor verses the result of an Intel SIMD vector compare verses nVidia, ... as to what constitutes the result of a vector compare.

OpenCL states signedness (- == true), GCC states !false with false==0, ICC (tends to) follow GCC. Note that while the Intel (AMD?) SIMD instruction cmpeqps produces 0xFFFFFFFF for true and 0 for false, and thus satisfies:

   -==true (implies !true == false)
as well as
  0==false (!false == true)

That your assert code is in error as it assumes ~0==true. A proper assert should test only the sign bit. Running on a different vector capable CPU may only conform to the signed-ness for the true/false.

*** BTW In the Fortran world, the least significant bit is the true/false and the SIMD test result is valid there too.

Jim Dempsey

 

 

0 Kudos
nemequ
New Contributor I
1,244 Views

jimdempseyatthecove (Blackbelt) wrote:
Be careful about making assumptions.

Words to live by.  Yes, you're absolutely right.

jimdempseyatthecove (Blackbelt) wrote:
CLang (OpenCL) verses GCC verses ICL verses other vendor verses the result of an Intel SIMD vector compare verses nVidia, ... as to what constitutes the result of a vector compare.

OpenCL states signedness (- == true), GCC states !false with false==0, ICC (tends to) follow GCC. Note that while the Intel (AMD?) SIMD instruction cmpeqps produces 0xFFFFFFFF for true and 0 for false, and thus satisfies:

I didn't realize that about OpenCL, that's very good to know, thank you.  I believe TI's compilers try to stick close to the OpenCL specification for their SIMD stuff, so there is a good chance that will come up when I get around to them.

jimdempseyatthecove (Blackbelt) wrote:

   -==true (implies !true == false)
as well as
  0==false (!false == true)

That your assert code is in error as it assumes ~0==true. A proper assert should test only the sign bit. Running on a different vector capable CPU may only conform to the signed-ness for the true/false.

I'll definitely keep that in mind, but in my case I think it's probably necessary verify that all bits are set for true and unset for false, even if it means that on some architectures I have to add a second operation to extend the sign across the entire lane since there is an excellent chance that the result will be used as a mask in another operation.

For what it's worth NEON, AltiVec, and WASM all use 0 and ~0, so at least my most important targets shouldn't require any extra operations.  Even if I come across a platform which only sets the sign bit, a right shift is usually quite fast, so extending it shouldn't be that big of a deal.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,244 Views

>>so at least my most important targets shouldn't require any extra operations

If your current CPUs adopt Intel SIMD format, then you may likely have an additional operation (inserted by the compiler). While cmpeqps produces same lane-width bit fields for results, the newer ..._mask_... use a bitmask. So a compiler generated conversion between mask formats may be generated... and potentially require having a specific input bit setting requirement (whatever that is). My guess would be one of the

_mm..._mask_test_epi.._mask(a,b)

instructions (which can use the same register for a and b, and which becomes a 0/ non-zero test.

But this is only a guess.

Jim Dempsey

0 Kudos
Viet_H_Intel
Moderator
1,244 Views

I've reported this issue to our Developer. Case is CMPLRIL0-32449

0 Kudos
nemequ
New Contributor I
1,244 Views

jimdempseyatthecove (Blackbelt) wrote:
If your current CPUs adopt Intel SIMD format, then you may likely have an additional operation (inserted by the compiler).

I'm not sure if you mean that there are likely extra instructions generated by the compiler to get from <0/≥0 to 0/~0, or that in the future I'll need extra ops to get from 0/~0 (or <0/≥0) to AVX512-style… if the former I know that's not the case for at least NEON (FCMEQ), AltiVec/VMX (xvcmpeqsp, PDF, page 666), WASM, and MSA (FCEQ, PDF, page 158).

jimdempseyatthecove (Blackbelt) wrote:
While cmpeqps produces same lane-width bit fields for results, the newer ..._mask_... use a bitmask. So a compiler generated conversion between mask formats may be generated... and potentially require having a specific input bit setting requirement (whatever that is). My guess would be one of the

_mm..._mask_test_epi.._mask(a,b)

instructions (which can use the same register for a and b, and which becomes a 0/ non-zero test.

Thanks, _mm*_mask_test_epi*_mask looks good for going from an AVX512-style mask to a vector of 0/~0.  I was going to have to figure that once we start implementing AVX512 (which is likely soon, we're on AVX2 now).

Unfortunately I'm also going to need a portable version, and possibly architecture-specific versions if they're faster than the portable version, so I can implement AVX512 on machines without it (including Intel CPUs).  I'm thinking maybe a lookup table with nibbles for keys coupled with some widening operations may be the best option there, but TBH I haven't thought about it much yet, but if you have any ideas I'd be extremely interested.

Luckily going the other way shouldn't really be necessary since all the other architectures only have 128-bit or 64-bit vectors, so they can and should largely just use MMX/SSE*/AVX*.

Viet Hoang (Intel) wrote:
I've reported this issue to our Developer. Case is CMPLRIL0-32449

Excellent, thank you.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,244 Views

I am saying that the CLang/C++/(??) for array comparison operators has defined the use of the same width vector cell to contain the true/false of the comparison. I am additionally saying that the earlier versions of the "masked" SIMD instructions used same SIMD width vector cell, however, the newer instructions use a bitmask. Therefore, when the external language is using the wide width cell but the CPU instruction requires a bitmask, that the compiler will (ought to) insert the additional instruction. One could write the comparison function to return the __mmask.. type (or squished bitmask) however the language specification is lacking this. (you can write your own function/template).

>> I'm also going to need a portable version

Then you stick with the language requirements, I suggest you use

false = (value < 0) ! requires value be of a signed kind of same width as comparands
true = !false

But I also suggest you test this on all platforms and compilers

Jim Dempsey

0 Kudos
Reply