I noticed there is not user-level intrinsic ‘_mm512_sincos_ps’ or ‘_mm512_mask_sincos_ps’ defined in zmmintrin.h.
However, I have just found out that Intel compiler is emitting ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ when it autovectorises code and finds ‘cos’ and ‘sin’ operations on the same value.
I have been doing some tests and if I define my own ‘_mm512_sincos_ps’ at user-space level, Intel compiler recognises it and translates it into the appropriate ‘__svml_sincosf16’. However, the result of my code is incorrect, maybe because the parameters of my function are not the expected.
Could anyone please tell me why ‘_mm512_sincos_ps’ has not been defined in zmmintrin.h and what the expected parameters are so that I can define it appropriately ?
Thank you in advance.
Best regards.
(Using icc 14.0.1)
Link Copied
Thank you for your reply!
I agree. If someone could move the post to the ISA forum that would be great. I wouldn't like to replicate the post.
I need to generate something similar to what Intel Compiler does, so I should use the '__svml_sincos16' function. Using sin and cos separately would have a significant impact in performance.
Let me show you this example:
[cpp]
#include <math.h>
#pragma omp declare simd
float __attribute__((noinline)) sin_cos(float a, float *cos)
{
float sin;
sin = sinf(a);
*cos = cosf(a);
return sin;
}
void main(int argc, char *argv[])
{
float cos[16];
float sin[16];
float input = atof(argv[1]);
int i;
#pragma omp simd
for (i=0; i<16; i++)
{
sin = sin_cos(input + i, &cos);
}
for (i=0; i<16; i++)
{
printf("%d: sin=%f, cos=%f\n", i, sin, cos);
}
}
[/cpp]
I compile it with: icc -fopenmp -O3 sincos.c -S -mmic
(note that I use MIC, not AVX2).
In the generated code, the main function call to the following function:
[cpp]
# -- Begin _ZGVMN16vv_sin_cos.U
# mark_begin;
# Threads 4
.align 16,0x90
.globl _ZGVMN16vv_sin_cos.U
_ZGVMN16vv_sin_cos.U:
# parameter 1: %zmm0
# parameter 2: %zmm1
# parameter 3: %zmm2
..B2.1: # Preds ..B2.0 Latency 17
pushq %rbp #5.1
movq %rsp, %rbp #5.1
andq $-64, %rsp #5.1
subq $320, %rsp #5.1 c1
vmovaps %zmm23, 64(%rsp) #5.1 c5
vmovaps %zmm1, %zmm23 #5.1 c9
vmovaps %zmm20, 128(%rsp) #5.1 c9
vmovaps %zmm2, %zmm20 #5.1 c13
call __svml_sincosf16 #8.11 c17
..B2.10: # Preds ..B2.1 Latency 29
vmovaps %zmm1, %zmm3 #8.11 c1
movl $255, %eax #9.6 c1
kmov %eax, %k1 #9.6 c5
movl $43690, %eax #9.6 c5
kmov %eax, %k2 #9.6 c9
movl $21845, %eax #9.6 c9
kmov %k5, %ecx ...
[/cpp]
I just wanted to know how could I call this '__svml_sincosf16' function from user-space.
As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16". But the generated code is not correct :
[cpp]
#include <immintrin.h>
extern __m512 __ICL_INTRINCC _mm512_sincos_ps(__m512*, __m512);
__m512 __attribute__((noinline)) sin_cos(__m512 a, __m512 *cos)
{
__m512 sin;
sin = _mm512_sincos_ps(cos, a);
return sin;
}
void main(int argc, char *argv[])
{
float __attribute__((aligned(64))) cos[16];
float __attribute__((aligned(64))) sin[16];
__m512 * vsin = (__m512 *) &sin;
__m512 * vcos = (__m512 *) &cos;
__m512 input = _mm512_add_ps(_mm512_set1_ps(atof(argv[1])), _mm512_set_ps(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));
_mm512_store_ps(vsin, sin_cos(input, vcos));
int i;
for (i=0; i<16; i++)
{
printf("%d: sin=%f, cos=%f\n", i, sin, cos);
}
}
[/cpp]
[cpp]
sin_cos:
# parameter 1: %zmm0
# parameter 2: %rdi
..B2.1: # Preds ..B2.0 Latency 1
..___tag_value_sin_cos.19: #6.1
jmp __svml_sincosf16 #9.11 c1
.align 16,0x90
..___tag_value_sin_cos.21: #
# LOE
# mark_end;
[/cpp]
Any idea?
It should crash but the cosine is not computed.
The user maybe could call the functions __svml_sincosf16 and __svml_sincosf16_mask directly.
But it will be difficult to do from a program written in C, since these functions have a special interface.
These functions return the result to the two registers, that is very difficult (or impossible) to use the program in C.
Synopsis of these functions is something like the following:
(sin_res, cos_res) = __svml_sincosf16(source)
(sin_res, cos_res) = __svml_sincosf16_mask(sin_dest, cos_dest, mask, source)
Maybe you should try using the assembly code in your C program like the way compiler does in order to make the generated code correct.
Best, Qiao
For more complete information about compiler optimizations, see our Optimization Notice.