Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

How to use ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ from user space

Diego_Caballero
Beginner
825 Views

I noticed there is not user-level intrinsic ‘_mm512_sincos_ps’ or ‘_mm512_mask_sincos_ps’ defined in zmmintrin.h.

However, I have just found out that Intel compiler is emitting ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ when it autovectorises code and finds ‘cos’ and ‘sin’ operations on the same value.

 I have been doing some tests and if I define my own ‘_mm512_sincos_ps’ at user-space level, Intel compiler recognises it and translates it into the appropriate ‘__svml_sincosf16’. However, the result of my code is incorrect, maybe because the parameters of my function are not the expected.

 Could anyone please tell me why ‘_mm512_sincos_ps’ has not been defined in zmmintrin.h and what the expected parameters are so that I can define it appropriately ?

 Thank you in advance.

Best regards.

(Using icc 14.0.1)

0 Kudos
13 Replies
SergeyKostrov
Valued Contributor II
825 Views
Guys, I simply would like to note that if you have some issues with Intel intrinsic functions it is better to post a description of the problem / issue on ISA forum. We're on Intel C++ compiler forum at the moment. However, the question is very interesting and I'll take a look. Thanks in advance for posting in right IDZ Forums.
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
>>...I have been doing some tests and if I define my own ‘_mm512_sincos_ps’ at user-space level, Intel compiler recognises it and >>translates it into the appropriate ‘__svml_sincosf16’. However, the result of my code is incorrect, maybe because the parameters of >>my function are not the expected... Could you post a complete test case in order to reproduce the problem? Please also upload zmmintrin.h for verification ( I could have a different release / update... ).
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
>>...it is better to post a description of the problem / issue on ISA forum... Here is the web-link: http://software.intel.com/en-us/forums/intel-isa-extensions
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
Why wouldn't you combine _mm512_sin_ps and _mm512_cos_ps intrinsic functions in a macro or in a naked function like _mm512_sincos_ps ( inputs are two arguments )? The same applies to the 2nd function you need ( inputs are six arguments ). Another solution is to implement what you need using inline assembler.
0 Kudos
Bernard
Valued Contributor I
825 Views

Hi Diego

If you plan to use inline assembly to implement your own trigo function I can share my code with you.My implementation uses SSE technology,but you can rewrite it to use AVX.I did not implement argument reduction.

 

0 Kudos
Diego_Caballero
Beginner
825 Views

Thank you for your reply!

I agree. If someone could move the post to the ISA forum that would be great. I wouldn't like to replicate the post.

I need to generate something similar to what Intel Compiler does, so I should use the '__svml_sincos16' function. Using sin and cos separately would have a significant impact in performance.

Let me show you this example:

[cpp]

#include <math.h>

#pragma omp declare simd
float __attribute__((noinline)) sin_cos(float a, float *cos)
{
    float sin;

    sin = sinf(a);
    *cos = cosf(a);

    return sin;
}

void main(int argc, char *argv[])
{
    float cos[16];
    float sin[16];
    float input = atof(argv[1]);

    int i;
#pragma omp simd
    for (i=0; i<16; i++)
    {
        sin = sin_cos(input + i, &cos);
    }

    for (i=0; i<16; i++)
    {
        printf("%d: sin=%f, cos=%f\n", i, sin, cos);
    }
}

[/cpp]

I compile it with: icc -fopenmp -O3 sincos.c -S -mmic

(note that I use MIC, not AVX2).

In the generated code, the main function call to the following function:

[cpp]

# -- Begin  _ZGVMN16vv_sin_cos.U
# mark_begin;
# Threads 4
        .align    16,0x90
    .globl _ZGVMN16vv_sin_cos.U
_ZGVMN16vv_sin_cos.U:
# parameter 1: %zmm0
# parameter 2: %zmm1
# parameter 3: %zmm2
..B2.1:                         # Preds ..B2.0 Latency 17
        pushq     %rbp                                          #5.1
        movq      %rsp, %rbp                                    #5.1
        andq      $-64, %rsp                                    #5.1
        subq      $320, %rsp                                    #5.1 c1
        vmovaps   %zmm23, 64(%rsp)                              #5.1 c5
        vmovaps   %zmm1, %zmm23                                 #5.1 c9
        vmovaps   %zmm20, 128(%rsp)                             #5.1 c9
        vmovaps   %zmm2, %zmm20                                 #5.1 c13
        call      __svml_sincosf16                              #8.11 c17
..B2.10:                        # Preds ..B2.1 Latency 29
        vmovaps   %zmm1, %zmm3                                  #8.11 c1
        movl      $255, %eax                                    #9.6 c1
        kmov      %eax, %k1                                     #9.6 c5
        movl      $43690, %eax                                  #9.6 c5
        kmov      %eax, %k2                                     #9.6 c9
        movl      $21845, %eax                                  #9.6 c9
        kmov      %k5, %ecx     ...

[/cpp]

I just wanted to know how could I call this '__svml_sincosf16' function from user-space.

As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16". But the generated code is not correct :

[cpp]

#include <immintrin.h>

extern __m512  __ICL_INTRINCC _mm512_sincos_ps(__m512*, __m512);

__m512 __attribute__((noinline)) sin_cos(__m512 a, __m512 *cos)
{
    __m512 sin;

    sin = _mm512_sincos_ps(cos, a);

    return sin;
}

void main(int argc, char *argv[])
{
    float __attribute__((aligned(64))) cos[16];
    float __attribute__((aligned(64))) sin[16];
    __m512 * vsin = (__m512 *) &sin;
    __m512 * vcos = (__m512 *) &cos;

    __m512 input = _mm512_add_ps(_mm512_set1_ps(atof(argv[1])), _mm512_set_ps(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));

    _mm512_store_ps(vsin, sin_cos(input, vcos));

    int i;
    for (i=0; i<16; i++)
    {
        printf("%d: sin=%f, cos=%f\n", i, sin, cos);
    }
}

[/cpp]

[cpp]

sin_cos:
# parameter 1: %zmm0
# parameter 2: %rdi
..B2.1:                         # Preds ..B2.0 Latency 1
..___tag_value_sin_cos.19:                                      #6.1
        jmp       __svml_sincosf16                              #9.11 c1
        .align    16,0x90
..___tag_value_sin_cos.21:                                      #
                                # LOE
# mark_end;

[/cpp]

Any idea?

0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
>>...As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16". >>But the generated code is not correct... If this is a bug, or an undocumented feature of Intel C++ compiler, then I wouldn't expect a quick fix ( it could actually take many weeks if not months ) and you should go ahead with a workaround based on a call to __svml_sincos16 function. Another workaround could be based on MKL vectorized functions and take a look at mkl_vml_functions.h: ... /* Sine & cosine: r1 = sin(a), r2=cos(a) */ _MKL_API( void,VSSINCOS,(const MKL_INT *n, const float a[], float r1[], float r2[]) ) _MKL_API( void,VDSINCOS,(const MKL_INT *n, const double a[], double r1[], double r2[]) ) _mkl_api( void,vssincos,(const MKL_INT *n, const float a[], float r1[], float r2[]) ) _mkl_api( void,vdsincos,(const MKL_INT *n, const double a[], double r1[], double r2[]) ) _Mkl_Api( void,vsSinCos,(const MKL_INT n, const float a[], float r1[], float r2[]) ) _Mkl_Api( void,vdSinCos,(const MKL_INT n, const double a[], double r1[], double r2[]) ) _MKL_API( void,VMSSINCOS,(const MKL_INT *n, const float a[], float r1[], float r2[], MKL_INT64 *mode) ) _MKL_API( void,VMDSINCOS,(const MKL_INT *n, const double a[], double r1[], double r2[], MKL_INT64 *mode) ) _mkl_api( void,vmssincos,(const MKL_INT *n, const float a[], float r1[], float r2[], MKL_INT64 *mode) ) _mkl_api( void,vmdsincos,(const MKL_INT *n, const double a[], double r1[], double r2[], MKL_INT64 *mode) ) _Mkl_Api (void,vmsSinCos,(const MKL_INT n, const float a[], float r1[], float r2[], MKL_INT64 mode) ) _Mkl_Api (void,vmdSinCos,(const MKL_INT n, const double a[], double r1[], double r2[], MKL_INT64 mode) ) ...
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
>>...As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16". >>But the generated code is not correct... Does it crash the test application?
0 Kudos
Diego_Caballero
Beginner
825 Views

It should crash but the cosine is not computed.

0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
I'll try to investigate this week...
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
I see that immintrin.h has two _mm256_sincos_xx intrinsic functions: ... extern __m256 __ICL_INTRINCC _mm256_sincos_ps(__m256 *, __m256); extern __m256d __ICL_INTRINCC _mm256_sincos_pd(__m256d *, __m256d); ... and I can't explain so far why zmmintrin.h does not have 512-bit versions of these intrinsic functions. Any comments from Intel Software Engineers?
0 Kudos
SergeyKostrov
Valued Contributor II
825 Views
I would also try to use these IPP functions as a workaround: ... IPPAPI( IppStatus, ippsSinCos_32f_A11, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n)) IPPAPI( IppStatus, ippsSinCos_32f_A21, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n)) IPPAPI( IppStatus, ippsSinCos_32f_A24, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n)) IPPAPI( IppStatus, ippsSinCos_64f_A26, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n)) IPPAPI( IppStatus, ippsSinCos_64f_A50, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n)) IPPAPI( IppStatus, ippsSinCos_64f_A53, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n)) ... because I have no answer so far.
0 Kudos
QIAOMIN_Q_
New Contributor I
825 Views

The user maybe could call the functions __svml_sincosf16 and __svml_sincosf16_mask directly.
But it will be difficult to do from a program written in C, since these functions have a special interface.
These functions return the result to the two registers, that is very difficult (or impossible) to use the program in C.
Synopsis of these functions is something like the following:
    (sin_res, cos_res) = __svml_sincosf16(source)
    (sin_res, cos_res) = __svml_sincosf16_mask(sin_dest, cos_dest, mask, source)

Maybe you should try using the assembly code in your C program like the way compiler does in order to make the generated code correct.

 

Best, Qiao

0 Kudos
Reply