Conversion from float64 to int64 - intrinsics

Przemyslaw_K_ · ‎06-03-2014

Hi,

I am looking for a way to convert between 8xFloat64 vector and 8xInt64 vector. I was looking at intrinsics(in 14.0 reference and in zmmintr.h) and I found only this:

__m512i __cdecl _mm512_cvtfxpnt_roundpd_epu32lo(__m512d v2, int rc);

but what I'm looking for is rather like:

__m512i __cdecl _mm512_cvtfxpnt_roundpd_epu64(__m512d v2, int rc);

Is there any trick for doing this kind of conversion?

Thanks,
Przemek.

Kevin_D_Intel · ‎06-04-2014

One of our expert Developers shared the following code to perform such conversion relying on compiler vectorization:

__m512i _vec512_cvtpd_epi64(__m512 d) {
    __m512i res;
    int i;
    for (i=0;i<8;i++) {
        ((__int64*)&res) = (__int64)(((double*)&d));
    }
    return res;
}

Hope this helps.

Patrick_S_ · ‎06-04-2014

I think there is a typo. The input parameter must be a __m512d type.

How is it possible to vectorize that code if there is no corresponding vector conversion from float64 to int64?

I tried it myself with the following code:

#include <iostream>
#include <immintrin.h>

inline __m512i _vec512_cvtpd_epi64( __m512d d ) {

        __m512i res;

        #pragma ivdep
        for ( int i = 0; i < 8; ++i ) {

                ( (__int64*)&res ) = (__int64)( ((double*)&d) );
        }

        return res;
}


int main( int argc, char **argv ) {

        double  *array_d = ( double*)_mm_malloc( 16 * sizeof( double), 64 );
        __int64 *array_i = (__int64*)_mm_malloc( 16 * sizeof(__int64), 64 );

        for ( int i = 0; i < 16; ++i ) {

                array_d = 12.123456789 + i;
                array_i = 0;
        }

        __m512d d_;
        __m512i i_;

        asm("start__");
        for ( int i = 0; i < 2; ++i ) {

                d_ = _mm512_load_pd( array_d + 8*i );

                asm("start_cnvt");
                i_ = _vec512_cvtpd_epi64( d_ );
                asm("end_cnvt");

                _mm512_store_epi64( array_i + 8*i, i_ );
        }
        asm("end__");

        for ( int i = 0; i < 16; ++i ) {

                std::cout << array_i << " ";
        }
        std::cout << std::endl;

        _mm_free( array_d );
        _mm_free( array_i );

        return 0;
}

I compiled it with -O3. The compiler generates the following assembly code:

# Begin ASM
# Begin ASM
        start_cnvt
# End ASM                                                       #36.0
# End ASM
                                # LOE rbx r12 r13 r15 r14b
..B1.22:                        # Preds ..B1.23 Latency 153
        vmovaps   64(%rsp), %zmm0                               #37.8 c1
        movl      $21845, %eax                                  #37.8 c1
        vbroadcastsd .L_2il0floatpacket.8(%rip), %zmm2          #37.8 c5
        vpandq    .L_2il0floatpacket.7(%rip){1to8}, %zmm0, %zmm14 #37.8 c9
        vpandnq   %zmm14, %zmm2, %zmm17                         #37.8 c13
        kmov      %eax, %k6                                     #37.8 c13
        vpxord    %zmm19, %zmm19, %zmm19                        #37.8 c17
        movl      $43690, %eax                                  #37.8 c17
        vmovaps   %zmm19, %zmm3                                 #37.8 c21
        kmov      %eax, %k1                                     #37.8 c21
        vmovdqa32 %zmm17{cdab}, %zmm3{%k6}                      #37.8 c25
        vpbroadcastq .L_2il0floatpacket.10(%rip), %zmm5         #37.8 c29
        vpsrld    $20, %zmm3, %zmm4                             #37.8 c33
        vpsubd    %zmm5, %zmm4, %zmm6                           #37.8 c37
        vpandq    .L_2il0floatpacket.8(%rip){1to8}, %zmm0, %zmm1 #37.8 c41
        vpshufd   $160, %zmm6, %zmm11                           #37.8 c45
        vporq     .L_2il0floatpacket.9(%rip){1to8}, %zmm1, %zmm9 #37.8 c49
        vpsubrd   .L_2il0floatpacket.12(%rip){1to16}, %zmm11, %zmm10 #37.8 c53
        vpcmpltd  .L_2il0floatpacket.12(%rip){1to16}, %zmm11, %k4 #37.8 c57
        vpcmpd    $4, %zmm19, %zmm11, %k3                       #37.8 c61
        vbroadcastsd .L_2il0floatpacket.11(%rip), %zmm16        #37.8 c65
        vpsrlvd   %zmm10, %zmm9, %zmm12                         #37.8 c69
        kand      %k4, %k3                                      #37.8 c69
        vpsllvd   %zmm11, %zmm9, %zmm13                         #37.8 c73
        kand      %k1, %k3                                      #37.8 c73
        vaddpd    {rz}, %zmm16, %zmm14, %zmm15                  #37.8 c77
        vmovaps   %zmm19, %zmm7                                 #37.8 c81
        vmovdqa32 %zmm9{cdab}, %zmm7{%k1}                       #37.8 c85
        vpsubd    .L_2il0floatpacket.12(%rip){1to16}, %zmm11, %zmm8 #37.8 c89
        vpord     %zmm12{cdab}, %zmm13, %zmm13{%k3}             #37.8 c93
        vpsubd    %zmm16, %zmm15, %zmm18                        #37.8 c97
        vcmpltpd  .L_2il0floatpacket.11(%rip){1to8}, %zmm17, %k5 #37.8 c101
        vpsllvd   %zmm8, %zmm7, %zmm20                          #37.8 c105
        vmovdqa32 %zmm13, %zmm20{%k4}                           #37.8 c109
        vmovdqa64 %zmm18, %zmm20{%k5}                           #37.8 c113
        vcmpltpd  %zmm19, %zmm0, %k2                            #37.8 c117
        vpsubsetbd %zmm20, %k7, %zmm19{%k6}                     #37.8 c121
        nop                                                     #37.8 c125
        kmov      %k7, %edx                                     #37.8 c129
        addl      %edx, %edx                                    #37.8 c133
        kmov      %edx, %k3                                     #37.8 c137
        vpsbbd    %zmm20, %k3, %zmm19{%k1}                      #37.8 c141
        vmovapd   %zmm19, %zmm20{%k2}                           #37.8 c145
        nop                                                     #37.8 c149
        vmovaps   %zmm20, (%rsp)                                #37.8 c153
                                # LOE rbx r12 r13 r15 r14b
..B1.21:                        # Preds ..B1.22 Latency 0
# Begin ASM
# Begin ASM
        end_cnvt
# End ASM                                                       #38.0
# End ASM

This doesn't look quiet efficient for me compared to e.g. a conversion from float to int32.

Przemyslaw_K_ · ‎06-04-2014

Hi,

thanks for the tip. I've checked it and I'm a little bit disappointed with the result (in terms of performance)... I was hoping that there might be some piece of documentation missing about int64 vector intrinsics. The main reason why I don't want to rely on compiler optimizations is because i'm doing instruction level performance measurements. Otherwise I see no good reason for using intrinsics...

I've made some examples (see attachments). Both of them created with:

icc -mmic -O2 -S XXX.cpp

This is the assembly that I'm getting when using that piece of code (are #preds labels correct - e.g. shouldn't B1.1 be B1.7?) in test.cpp:

                                # LOE rbx r12 r14
..B1.7:                         # Preds ..B1.1 Latency 163
        stmxcsr   (%rsp)                                        #21.1 c1
        movl      $il0_peep_printf_format_0, %edi               #35.3 c2
        orl       $32832, (%rsp)                                #21.1 c2
        ldmxcsr   (%rsp)                                        #21.1 c6
        vbroadcastsd .L_2il0floatpacket.2(%rip), %zmm0          #22.16 c7
        vpxord    %zmm19, %zmm19, %zmm19                        #30.8 c11
        .byte     15                                            #28.17 c11
        .byte     49                                            #28.17
        vmovaps   %zmm19, %zmm3                                 #30.8 c15
        movq      %rax, %r13                                    #28.17 c15
        vbroadcastsd .L_2il0floatpacket.4(%rip), %zmm2          #30.8 c19
        vpandq    .L_2il0floatpacket.3(%rip){1to8}, %zmm0, %zmm14 #30.8 c23
        vpandnq   %zmm14, %zmm2, %zmm17                         #30.8 c27
        movl      $21845, %eax                                  #30.8 c27
        vpbroadcastq .L_2il0floatpacket.6(%rip), %zmm5          #30.8 c31
        vpandq    .L_2il0floatpacket.4(%rip){1to8}, %zmm0, %zmm1 #30.8 c35
        vporq     .L_2il0floatpacket.5(%rip){1to8}, %zmm1, %zmm9 #30.8 c39
        vbroadcastsd .L_2il0floatpacket.7(%rip), %zmm16         #30.8 c43
        vaddpd    {rz}, %zmm16, %zmm14, %zmm15                  #30.8 c47
        kmov      %eax, %k6                                     #30.8 c47
        vmovdqa32 %zmm17{cdab}, %zmm3{%k6}                      #30.8 c51
        movl      $43690, %eax                                  #30.8 c51
        vpsrld    $20, %zmm3, %zmm4                             #30.8 c55
        kmov      %eax, %k1                                     #30.8 c55
        vpsubd    %zmm5, %zmm4, %zmm6                           #30.8 c59
        vmovaps   %zmm19, %zmm7                                 #30.8 c63
        vpshufd   $160, %zmm6, %zmm11                           #30.8 c67
        vpsubrd   .L_2il0floatpacket.8(%rip){1to16}, %zmm11, %zmm10 #30.8 c71
        vpcmpltd  .L_2il0floatpacket.8(%rip){1to16}, %zmm11, %k4 #30.8 c75
        vpcmpd    $4, %zmm19, %zmm11, %k3                       #30.8 c79
        vpsrlvd   %zmm10, %zmm9, %zmm12                         #30.8 c83
        vpsllvd   %zmm11, %zmm9, %zmm13                         #30.8 c87
        kand      %k4, %k3                                      #30.8 c87
        vmovdqa32 %zmm9{cdab}, %zmm7{%k1}                       #30.8 c91
        kand      %k1, %k3                                      #30.8 c91
        vpsubd    .L_2il0floatpacket.8(%rip){1to16}, %zmm11, %zmm8 #30.8 c95
        vpord     %zmm12{cdab}, %zmm13, %zmm13{%k3}             #30.8 c99
        vpsubd    %zmm16, %zmm15, %zmm18                        #30.8 c103
        vcmpltpd  .L_2il0floatpacket.7(%rip){1to8}, %zmm17, %k5 #30.8 c107
        vpsllvd   %zmm8, %zmm7, %zmm1                           #30.8 c111
        .byte     15                                            #32.16 c111
        .byte     49                                            #32.16
        vmovdqa32 %zmm13, %zmm1{%k4}                            #30.8 c115
        movq      %rdx, %r15                                    #32.16 c115
        vmovdqa64 %zmm18, %zmm1{%k5}                            #30.8 c119
        vcmpltpd  %zmm19, %zmm0, %k2                            #30.8 c123
        vpsubsetbd %zmm1, %k7, %zmm19{%k6}                      #30.8 c127
        nop                                                     #30.8 c131
        kmov      %k7, %ecx                                     #30.8 c135
        addl      %ecx, %ecx                                    #30.8 c139
        kmov      %ecx, %k3                                     #30.8 c143
        vpsbbd    %zmm1, %k3, %zmm19{%k1}                       #30.8 c147
        vmovapd   %zmm19, %zmm1{%k2}                            #30.8 c151
        nop                                                     #35.3 c155
        vmovaps   %zmm1, 64(%rsp)                               #35.3 c159
        call      puts                                          #35.3 c163
                                # LOE rbx r12 r13 r14 r15
..B1.2:                         # Preds ..B1.7 Latency 9

For the comparison this is the code for _mm512_cvtfxpnt_roundpd_epu32lo in test2.cpp:

                                # LOE r12 r13 r14
..B1.7:                         # Preds ..B1.1 Latency 35
        stmxcsr   64(%rsp)                                      #21.1 c1
        movl      $il0_peep_printf_format_0, %edi               #35.3 c2
        orl       $32832, 64(%rsp)                              #21.1 c2
        ldmxcsr   64(%rsp)                                      #21.1 c6
        vbroadcastsd .L_2il0floatpacket.0(%rip), %zmm0          #22.16 c7
        .byte     15                                            #28.17 c11
        .byte     49                                            #28.17
        vcvtfxpntpd2udq $0, %zmm0, %zmm1                        #30.8 c15
        movq      %rax, %rbx                                    #28.17 c15
        nop                                                     #30.8 c19
        vmovaps   %zmm1, (%rsp)                                 #30.8 c23
        movb      %al, %al                                      #32.16 c23
        .byte     15                                            #32.16 c27
        .byte     49                                            #32.16
        movq      %rdx, %r15                                    #32.16 c31
        call      puts                                          #35.3 c35
                                # LOE rbx r12 r13 r14 r15
..B1.2:                         # Preds ..B1.7 Latency 9

Thanks,

Przemek.

jimdempseyatthecove · ‎06-04-2014

Przemek,

I will outline the scheme, and let you select the _mm512_... intrinsics.

if the input vector d contains negative numbers

produce a mask of the negative numbers in incoming vector d

using XOR (abs, or other fast means) remove the sign bit from the input vector d (sign bit is left most bit of 64-bit elements). At this point d will have the absolute values of input vector d.

now then for both inputs of +/- (now abs(d)) perform a vector-wide add of (double)((__int64)1 << 51). The result now has right justified integer values combined with a non-zero exponent.

Next, remove the exponent with a vector wide and mask of (((__int64)1 <<51) -1). You now have a vector of int64's. These are truncated. If you want rounded, then add (or subtract) 0.5d as a separate step prior to the adding of (double)((__int64)1 << 51). Alternatively, after removing sign bit you can add (double)((__int64)1 << 50). Then add +1, then /2. More steps.

If the input vector could potentially contain negative numbers, then use the mask obtained earlier to perform the negates/merge.

In the case were input is only positive numbers, this should reduce to one ADD and one AND (with appropriate casts)

In the case where input may contain negative numbers, it will be a bit more complicated to produce properly signed results, however it will be completely vectorized and should be faster than the element by element conversion which will require the use of RAM/cache as opposed to registers.

Jim Dempsey

Kevin_D_Intel · ‎06-05-2014

Further feedback from the Developer is:

The assembly code for double->int64 conversion, which is shown in the forum, is expected instruction sequence.

The difference between float->int32 conversion and double->int64 conversion is that vector float->int32 has hardware support, so the compiler generates a single instruction for such conversion. However, there is no single vector instruction for double->int64 conversion on KNC, so the compiler generates quite long instruction sequence to emulate such conversion.

The labels "# Preds" labels appear correct. Those identify from which basic blocks the code can reach the current basic block. I cannot find any other reference other than this one that is dated but still useful, http://www.intel.in/content/dam/www/public/us/en/documents/white-papers/ia-32-64-assembly-lang-paper.pdf (see pages 16-17).

jimdempseyatthecove · ‎06-05-2014

After re-think, pseudo code

always positive inputs, truncated

out = and(in+Kshift, maskMantissa)

always positive inputs, rounded

out = and((in+.5) + Kshift, maskMantissa)

perform (in+.5) as a separate _mm512 operation

signed inputs, truncated

signs = and(in, maskSigns)

out = or(and(xor(in,signs)+Kshift, maskMantissa), signs)

signed input rounded

signs = and(in, maskSigns)

out = or(and((xor(in,signs)+.5)+Kshift, maskMantissa), signs)

Where:

Kshift is a vector of doubles that when added to input value x (absolute value if necessary) causes the lsb of the integer portion of the number to shift to the lsb of the 64-bit double. *** There may be an issue where the in+Kshift rounds the number, you will have to decide if you want rounding or truncation. If the +Kshift rounds .AND. you want rounded numbers, then you can omit the +0.5 step as this is done for you in hardware.

maskMantissa is a bitmask of 51 bits of 1's in the lsb position of the 64-bit integer

maksSigns is a bit mask with only the msb (sign bit) set in the 64-bit int (and in the double)

Assuming in+Kshift produced the correct right shifted number (rounding behavior you want), then the signed input rounded should reduce to

signs = and(in, maskSigns)

out = or(and(xor(in,signs)+Kshift, maskMantissa), signs)

or 5 instructions to convert the 8 doubles to signed int64

YMMV

Jim Dempsey

jimdempseyatthecove · ‎06-05-2014

Ignore signed for now, what I showed won't work. Sorry for any inconvenience this may have cause you.

Jim Dempsey