- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Strange, but AVX version this code slower XMM version in 20 times.
CPU i7-6950X 3.0GHz
Working on Win10/X64. Is any idea how run this code properly?
IDA Pro dissasembly shows what code right. No errors.
procedure ScanLineVec256(X0, X1: Integer; P: TVertexD); asm .NOFRAME sub X1, X0 inc X1 movq2dq xmm3, mm6 movd xmm11, r12d shufps xmm11, xmm11, 00b //vmovdqu ymm10, [R8] db $C4, $41, $7E, $6F, $10 @X: add r10, 4 add r11, 4 //vaddpd ymm10, ymm10, ymm13 db $C4, $41, $2D, $58, $D5 //vandpd ymm0, ymm10, ymm15 db $C4, $C1, $2D, $54, $C7 //vxorpd ymm0, ymm0, ymm15 db $C4, $C1, $7D, $57, $C7 //vptest ymm0, ymm0 db $C4, $E2, $7D, $17, $C0 jz @Inside dec X1 jnz @X ret @Inside: //vmovdqa ymm1, ymm12 db $C5, $7D, $7F, $E1 //vmulpd ymm1, ymm1, ymm10 db $C4, $C1, $75, $59, $CA //vmovdqa ymm4, ymm1 db $C5, $FD, $7F, $CC //vmulpd ymm1, ymm1, ymm14 db $C4, $C1, $75, $59, $CE //Extract (X+Y) //vextractf128 xmm2, ymm1, 01b db $C4, $E3, $7D, $19, $CA, $01 //(X+Y)+Z addsd xmm2, xmm1 psrldq xmm1, 8 addsd xmm1, xmm2 movq xmm0, R13 divsd xmm0, xmm1 cvtsd2ss xmm0, xmm0 comiss xmm0, dword ptr [r10] jb @Below dec X1 jnz @X ret @Below: movd dword ptr [r10], xmm0 shufps xmm0, xmm0, 00b //vcvtpd2ps xmm4, ymm4 db $C5, $FD, $5A, $E4 movd dword ptr [r11], xmm0 dec X1 jnz @X end;
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems like life would be a lot easier if you had an assembler that understood AVX instructions?
I notice that the code contains a mix of 128-bit and 256-bit operations. The operations that you have directly encoded in binary clearly use the AVX prefixes, but some of the 128-bit operations (e.g., "movq2dq xmm3,mm6" in line 08) might be compiled into SSE code. If there is a mixture of SSE and AVX code, then you need to follow the instructions in Section 12.3 "Mixing AVX Code with SSE Code" from the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-040, April 2018, today I found the manual at https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank for reply. Already solve my problem. VZEROUPPER instruction help run my code correctly.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page