Dot(*,*) does not translate to DPPS

Paul_S_ · ‎01-15-2014

Hi all,

I'm curious why the dot(*,*) function does not translate into the DPPS instruction for float4 data types. Instead it translates into a VMULPS followed by two VHADDPS. (Compiled with Intel(R) OpenCL(TM) Offline Compiler Command-Line Client, version 1.0.2 with AVX enabled).

Thanks,
Paul

Raghupathi_M_Intel · ‎01-22-2014

I am guessing its for performance reasons. dpps has longer latency than vmulps and vhaddps.

Thanks,
Raghu

Paul_S_ · ‎01-22-2014

Thanks for the reply. If dpps has a longer latency than doing one vmulps and two vhaddps, I'm curious why one should every use a dpps. Best, Paul

Raghupathi_M_Intel · ‎01-22-2014

The recommendation is not to use it :-)

Paul_S_ · ‎01-22-2014

That's kind of odd, why would you have such an instruction in this case. By the way, the following code snippet will be translated into a dpps call: void dot(float * a, float*__restrict__ b, float *__restrict__ c){ *c = a[0] * b[0]; *c += a[1] * b[1]; *c += a[2] * b[2]; *c += a[3] * b[3]; } ::L__routine_start__Z3dotPfS_S(void): dot(float*, float*, float*): vmovups xmm0, XMMWORD PTR [rsi] #9.3 vmovups xmm1, XMMWORD PTR [rdi] #9.3 vdpps xmm2, xmm0, xmm1, 241 #9.3 vmovss DWORD PTR [rdx], xmm2 #9.3 ret (Using icc 13.0.1 with flags set to: -O3 -masm=intel -mavx) Thanks, Paul

Paul_S_ · ‎01-22-2014

That's kind of odd, why would you have such an instruction in this case. By the way, the following code snippet will be translated into a dpps call: void dot(float * a, float*__restrict__ b, float *__restrict__ c){ *c = a[0] * b[0]; *c += a[1] * b[1]; *c += a[2] * b[2]; *c += a[3] * b[3]; } ::L__routine_start__Z3dotPfS_S(void): dot(float*, float*, float*): vmovups xmm0, XMMWORD PTR [rsi] #9.3 vmovups xmm1, XMMWORD PTR [rdi] #9.3 vdpps xmm2, xmm0, xmm1, 241 #9.3 vmovss DWORD PTR [rdx], xmm2 #9.3 ret (Using icc 13.0.1 with flags set to: -O3 -masm=intel -mavx) Thanks, Paul