How to convert __m512d to __m512 using AVX512 on KNL

Zekun_Y_ · ‎08-25-2016

Hi everyone,

I just need to convert __m512d to __m512 on the current project to gain better performance as I can handle more numbers at the same time.

I'm not so familiar with the AVX512 extension. So I'm not sure whether my code is the most efficient way to do this.

My code is as below:

inline void
pd2ps(__m512d *a1,__m512d *a2,__m512 *b){
	__m256 t1,t2;
	__m512 tb1,tb2;
	int rouding = _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC;
	t1 = _mm512_cvt_roundpd_ps(*a1,rouding);
	t2 = _mm512_cvt_roundpd_ps(*a2,rouding);
	tb1 = _mm512_castps256_ps512(t1);
	tb2 = _mm512_castps256_ps512(t2);
	*b = _mm512_shuffle_f32x4(tb1,tb2,0x44);

}

Is there any better method to convert __m512d to __m512?

Hope you can give me some advices and share the efficient implementation.

Thank you.

jimdempseyatthecove · ‎08-25-2016

Please note that compiler optimizations may do a better job at optimizing the C/C++ code.

for(int i=0;i<N; ++i)
  ArrayOf_floats = ArrayOf_doubles;

Generate the code in release mode, then look at the disassembly with the debugger... or use VTune, with both implementations, and examine the performance as well as the code generated (Assembly view).

*** Make sure you defeat compiler optimizations from eliminating unused results.
*** Run both in a loop several times, discarding the first iteration (avoid cache load biases)

Jim Dempsey

Zekun_Y_ · ‎08-25-2016

Thank you Jim, I will test the two ways and submit the results later.

Best regards,

Zekun

Zekun_Y_ · ‎08-25-2016

Hi Jim,

This is the assemble code of the way you just mentioned with icc parameters -xMIC-AVX512 -O2, it looks quite the same with the avx512 code I wrote.

The KNL server is not available right now, I will submit the performance results later.

        vcvtpd2ps (%r14,%r12,8), %ymm0                          #34.17 c1
        vcvtpd2ps 64(%r14,%r12,8), %ymm1                        #34.17 c1
        vmovaps   %zmm0, %zmm2{%k2}{z}                          #34.17 c3
        vshuff32x4 $68, %zmm1, %zmm1, %zmm2{%k1}                #34.17 c5
        vmovntps  %zmm2, (%r13,%r12,4)                          #34.3 c7

Best regards,

Zekun

jimdempseyatthecove · ‎08-25-2016

It is not the same, widen your scope (view more disassembly statements)

note that only one pointer is use (as opposed to a and b), and that the offset 64 is used to extract the 2nd variable. If you widen your scope, you may find that the compiler unrolled the loop, and kept using increasing offsets (128, 192, 256, ...).

Jim