Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

_mm_extract_ps returns int (for a long long time)



This issue looks like bad design or bug for a lot of programmers for many years. But problem is still there.

Why _mm_extract_ps returns int type? At first we can see intrinsics design features like _ps and _epi32 endings for float and int types respectively. We have _mm_extract_epi32 which calls pextrd instruction which return int type. And _mm_extract_ps uses extractps and return INT type again? But why? Will somebody fix it some day?

I want to write code like

template <int i> float get() const noexcept { return _mm_extract_ps(xmm_, i); }

and not like

template <int i> float get() const noexcept {
    int v = _mm_extract_ps(xmm_, i);
    float f;
    memcpy(&f, &v, sizeof(v)); // standard recommended cross-compiler type-punning for c++
    return f;

P.S. Also maybe somebody can explain why we need both extractps and pextrd assembly intructions when technically they are the same? I don't think they change some flags or do some checks anyway. Now I can't see the difference with

int _mm_extract_ps(__m128 xmm, int i) { return _mm_extract_epi32(_mm_castps_si128(xmm), i); }

Best regards, Vyacheslav

0 Kudos
2 Replies
New Contributor I

Assuming you're writing 64bit code, then floats are stored in xmm registers anyway.

So really want you want is a vector register shuffle to just move the floating point value into the bottom of the vector register and then to use that register in scalar mode.

See doug65536's answer here;

So something like;

template <int i> float get() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(xmm_, xmm_, _MM_SHUFFLE(0, 0, 0, i))); }


0 Kudos

Sorry but please no such assumings. I need to use SIMD code on x86, x64 with cross-compilers and platforms (win, lin, mac).

Thank you for link anyway. I found _MM_EXTRACT_FLOAT as official solution, that's pretty interesting and fun. For me it looks like bad design. Still wonder to know the reason for this solution.

I don't think that using PORT5 is a good idea anyway. Maybe shift solution is more simple and faster for CPU to perform:

template<int i> [[nodiscard]] float __vectorcall _mm_get_ps(__m128 v) {
    return _mm_cvtss_f32(_mm_castsi128_ps(_mm_srli_si128(_mm_castps_si128(x), i * 4)));

0 Kudos