- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I just need to convert __m512d to __m512 on the current project to gain better performance as I can handle more numbers at the same time.
I'm not so familiar with the AVX512 extension. So I'm not sure whether my code is the most efficient way to do this.
My code is as below:
inline void pd2ps(__m512d *a1,__m512d *a2,__m512 *b){ __m256 t1,t2; __m512 tb1,tb2; int rouding = _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC; t1 = _mm512_cvt_roundpd_ps(*a1,rouding); t2 = _mm512_cvt_roundpd_ps(*a2,rouding); tb1 = _mm512_castps256_ps512(t1); tb2 = _mm512_castps256_ps512(t2); *b = _mm512_shuffle_f32x4(tb1,tb2,0x44); }
Is there any better method to convert __m512d to __m512?
Hope you can give me some advices and share the efficient implementation.
Thank you.
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please note that compiler optimizations may do a better job at optimizing the C/C++ code.
for(int i=0;i<N; ++i) ArrayOf_floats = ArrayOf_doubles;
Generate the code in release mode, then look at the disassembly with the debugger... or use VTune, with both implementations, and examine the performance as well as the code generated (Assembly view).
*** Make sure you defeat compiler optimizations from eliminating unused results.
*** Run both in a loop several times, discarding the first iteration (avoid cache load biases)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Jim, I will test the two ways and submit the results later.
Best regards,
Zekun
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
This is the assemble code of the way you just mentioned with icc parameters -xMIC-AVX512 -O2, it looks quite the same with the avx512 code I wrote.
The KNL server is not available right now, I will submit the performance results later.
vcvtpd2ps (%r14,%r12,8), %ymm0 #34.17 c1 vcvtpd2ps 64(%r14,%r12,8), %ymm1 #34.17 c1 vmovaps %zmm0, %zmm2{%k2}{z} #34.17 c3 vshuff32x4 $68, %zmm1, %zmm1, %zmm2{%k1} #34.17 c5 vmovntps %zmm2, (%r13,%r12,4) #34.3 c7
Best regards,
Zekun
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is not the same, widen your scope (view more disassembly statements)
note that only one pointer is use (as opposed to a and b), and that the offset 64 is used to extract the 2nd variable. If you widen your scope, you may find that the compiler unrolled the loop, and kept using increasing offsets (128, 192, 256, ...).
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page