Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Using AVX opcodes slow my proc

DZiro
Beginner
592 Views

Strange, but AVX version this code slower XMM version in 20 times.
CPU i7-6950X 3.0GHz
Working on Win10/X64. Is any idea how run this code properly?
IDA Pro dissasembly shows what code right. No errors.

procedure ScanLineVec256(X0, X1: Integer; P: TVertexD);
asm
  .NOFRAME

  sub X1, X0
  inc X1

  movq2dq xmm3, mm6
  movd xmm11, r12d
  shufps xmm11, xmm11, 00b

  //vmovdqu ymm10, [R8]
  db $C4, $41, $7E, $6F, $10

 @X:

  add r10, 4
  add r11, 4

  //vaddpd ymm10, ymm10, ymm13
  db $C4, $41, $2D, $58, $D5

  //vandpd ymm0, ymm10, ymm15

  db $C4, $C1, $2D, $54, $C7
  //vxorpd ymm0, ymm0, ymm15
  db $C4, $C1, $7D, $57, $C7
  //vptest ymm0, ymm0
  db $C4, $E2, $7D, $17, $C0

  jz @Inside
  dec X1
  jnz @X
  ret

  @Inside:

  //vmovdqa ymm1, ymm12
  db $C5, $7D, $7F, $E1

  //vmulpd ymm1, ymm1, ymm10
  db $C4, $C1, $75, $59, $CA

  //vmovdqa ymm4, ymm1
  db $C5, $FD, $7F, $CC

  //vmulpd ymm1, ymm1, ymm14
  db $C4, $C1, $75, $59, $CE

  //Extract (X+Y)
  //vextractf128 xmm2, ymm1, 01b
  db $C4, $E3, $7D, $19, $CA, $01
  //(X+Y)+Z
  addsd xmm2, xmm1
  psrldq xmm1, 8
  addsd xmm1, xmm2

  movq xmm0, R13
  divsd xmm0, xmm1
  cvtsd2ss xmm0, xmm0

  comiss xmm0, dword ptr [r10]
  jb @Below
  dec X1
  jnz @X
  ret

 @Below:
  movd dword ptr [r10], xmm0
  shufps xmm0, xmm0, 00b

  //vcvtpd2ps xmm4, ymm4
  db $C5, $FD, $5A, $E4

  movd dword ptr [r11], xmm0

  dec X1
  jnz @X

end;

 

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
592 Views

It seems like life would be a lot easier if you had an assembler that understood AVX instructions?

I notice that the code contains a mix of 128-bit and 256-bit operations.  The operations that you have directly encoded in binary clearly use the AVX prefixes, but some of the 128-bit operations (e.g., "movq2dq xmm3,mm6" in line 08) might be compiled into SSE code.  If there is a mixture of SSE and AVX code, then you need to follow the instructions in Section 12.3 "Mixing AVX Code with SSE Code" from the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-040, April 2018, today I found the manual at https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf).

0 Kudos
DZiro
Beginner
592 Views

Thank for reply. Already solve my problem. VZEROUPPER instruction help run my code correctly.

0 Kudos
Reply