Using AVX opcodes slow my proc

DZiro · ‎06-08-2018

Strange, but AVX version this code slower XMM version in 20 times.
CPU i7-6950X 3.0GHz
Working on Win10/X64. Is any idea how run this code properly?
IDA Pro dissasembly shows what code right. No errors.

procedure ScanLineVec256(X0, X1: Integer; P: TVertexD);
asm
  .NOFRAME

  sub X1, X0
  inc X1

  movq2dq xmm3, mm6
  movd xmm11, r12d
  shufps xmm11, xmm11, 00b

  //vmovdqu ymm10, [R8]
  db $C4, $41, $7E, $6F, $10

 @X:

  add r10, 4
  add r11, 4

  //vaddpd ymm10, ymm10, ymm13
  db $C4, $41, $2D, $58, $D5

  //vandpd ymm0, ymm10, ymm15

  db $C4, $C1, $2D, $54, $C7
  //vxorpd ymm0, ymm0, ymm15
  db $C4, $C1, $7D, $57, $C7
  //vptest ymm0, ymm0
  db $C4, $E2, $7D, $17, $C0

  jz @Inside
  dec X1
  jnz @X
  ret

  @Inside:

  //vmovdqa ymm1, ymm12
  db $C5, $7D, $7F, $E1

  //vmulpd ymm1, ymm1, ymm10
  db $C4, $C1, $75, $59, $CA

  //vmovdqa ymm4, ymm1
  db $C5, $FD, $7F, $CC

  //vmulpd ymm1, ymm1, ymm14
  db $C4, $C1, $75, $59, $CE

  //Extract (X+Y)
  //vextractf128 xmm2, ymm1, 01b
  db $C4, $E3, $7D, $19, $CA, $01
  //(X+Y)+Z
  addsd xmm2, xmm1
  psrldq xmm1, 8
  addsd xmm1, xmm2

  movq xmm0, R13
  divsd xmm0, xmm1
  cvtsd2ss xmm0, xmm0

  comiss xmm0, dword ptr [r10]
  jb @Below
  dec X1
  jnz @X
  ret

 @Below:
  movd dword ptr [r10], xmm0
  shufps xmm0, xmm0, 00b

  //vcvtpd2ps xmm4, ymm4
  db $C5, $FD, $5A, $E4

  movd dword ptr [r11], xmm0

  dec X1
  jnz @X

end;

McCalpinJohn · ‎07-13-2018

It seems like life would be a lot easier if you had an assembler that understood AVX instructions?

I notice that the code contains a mix of 128-bit and 256-bit operations. The operations that you have directly encoded in binary clearly use the AVX prefixes, but some of the 128-bit operations (e.g., "movq2dq xmm3,mm6" in line 08) might be compiled into SSE code. If there is a mixture of SSE and AVX code, then you need to follow the instructions in Section 12.3 "Mixing AVX Code with SSE Code" from the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-040, April 2018, today I found the manual at https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf).

DZiro · ‎07-13-2018

Thank for reply. Already solve my problem. VZEROUPPER instruction help run my code correctly.