- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
http://www.anandtech.com/showdoc.aspx?i=3073&p=3
How about the matrix multiplication with Sandy Bridge? How many instructions does it need to do that?
Henri.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. I've quickly browsed the programmig reference, but I'm a little confused about the vex prefix and how it is encoded compared to the "normal" SSEx instructions. From chapter 4, "Elimination of escape opcode byte (0FH), SIMD Prefix byte (66H, F2H, F3H) ..." What I'm trying to figure out is how the instruction bytes will look if I'm looking at disassembley or debugging information of the AVX instructions. An example with one or two of the new instructions would be appreciated.
Knut Johnsen, Norway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Henri,
I couldn't quite figure what that other site was trying to do - it looks like just a fragment of the whole thing and the baseline has unnecessary copies. Here's my first attempt at an AVX version. I coded up C = A*B and looped over C and B so we can look at the throughput.
The number of instructions doesn't really matter here (at least on Sandy Bridge) - it appears to be limited by the number of data rearrangements, or multiplies. There's half as many multiplies in the AVX version, but the extra broadcasts should reduce our performance scaling somewhat. (I'll be back to the states soon so I can run it through the performance simulator and post the results here)
Another neat thing about the AVX version is the extra state - you can imagine wanting to reuse the column broadcast on A and save all those computations - like if one had to compute C = A*B and D = A*E.
// AVX throughput test of 4x4 MMULs
void MMUL4x4_AVX()
{
__asm {
mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c
loop_a:
vmovaps ymm0, [rax]// a13 a12 a11 a10 | a03 a02 a01 a00
vpermilps ymm1, ymm0, 0x00// a10 a10 a10 a10 | a00 a00 a00 a00
vpermilps ymm2, ymm0, 0x55// a11 a11 a11 a11 | a01 a01 a01 a01
vpermilps ymm3, ymm0, 0xCC// a12 a12 a12 a12 | a01 a02 a02 a02
vpermilps ymm4, ymm0, 0xFF// a13 a13 a13 a13 | a01 a03 a03 a03
vmovaps ymm0, [rax+32]// a33 a32 a31 a30 | a23 a22 a21 a20
vpermilps ymm5, ymm0, 0x00// a40 a30 a30 a30 | a20 a20 a20 a20
vpermilps ymm6, ymm0, 0x55// a41 a31 a31 a31 | a21 a21 a21 a21
vpermilps ymm7, ymm0, 0xCC// a42 a32 a32 a32 | a21 a22 a22 a22
vpermilps ymm8, ymm0, 0xFF// a43 a33 a33 a33 | a21 a23 a23 a23
vbroadcastf128 ymm9, [rbx]// b03 b02 b01 b00 | b03 b02 b01 b00
vbroadcastf128 ymm10, [rbx+16]// b13 b12 b11 b10 | b13 b12 b11 b10
vbroadcastf128 ymm11, [rbx+32]// b23 b22 b21 b20 | b23 b22 b21 b20
vbroadcastf128 ymm12, [rbx+48]// b33 b32 b31 b30 | b33 b32 b31 b30
vmulps ymm1, ymm1, ymm9
vmulps ymm2, ymm2, ymm10
vmulps ymm3, ymm3, ymm11
vmulps ymm4, ymm4, ymm12
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm4
vaddps ymm1, ymm1, ymm3
vmulps ymm5, ymm5, ymm9
vmulps ymm6, ymm6, ymm10
vmulps ymm7, ymm7, ymm11
vmulps ymm8, ymm8, ymm12
vaddps ymm5, ymm5, ymm6
vaddps ymm7, ymm7, ymm8
vaddps ymm5, ymm5, ymm7
vmovaps [rdx], ymm1
vmovaps [rdx+32], ymm5
add rbx, 64
add rdx, 64
sub ecx, 1
jg loop_a
}
}
// Baseline for comparsion (can you beat this on SNB?)
void MMUL4x4_SSE()
{
__asm {
mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c
loop_a:
movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00
pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01
pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02
pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03
movaps xmm5, [rbx]//b03 b02 b01 b00
movaps xmm6, [rbx+16]//b13 b12 b11 b10
movaps xmm7, [rbx+32]//b23 b22 b21 b20
movaps xmm8, [rbx+48]//b33 b32 b31 b30
mulps xmm1, xmm5//a00b03 a00b02 a00b01 a00b00
mulps xmm2, xmm6//a01b13 a01b12 a01b11 a01b10
mulps xmm3, xmm7//a02b23 a02b22 a02b21 a02b20
mulps xmm4, xmm8//a03b33 a03b32 a03b31 a03b30
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx], xmm1
movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10
shufps xmm2, xmm0, 0x55 // a11 a11 a11 a11
shufps xmm3, xmm0, 0xcc // a11 a12 a12 a12
shufps xmm4, xmm0, 0xFF // a11 a13 a13 a13
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mul
ps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+16], xmm1
movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20
pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21
pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22
pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+32], xmm1
movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30
pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31
pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32
pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+48], xmm1
add rbx, 64
add rdx, 64
sub ecx, 1
jg loop_a
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello again! Some more info about this was posted at the IDF website some time after my last entry! You can see the PDF here: https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf
But the follow up question will be, how can you know (looking at the Bytes in the code segment) if the instructions with C4H / C5H is the VEX prefix or the LES / LDS instruction?
Knut
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The disassembler in the Linux binutils 2.18.50.0.6 or above:
http://www.kernel.org/pub/linux/devel/binutils/
supports AVX:
[hjl@gnu-6 avx-2]$ objdump -Mintel -dw x.o
x.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 28 sub rsp,0x28
8: c5 fc 29 45 a0 vmovaps YMMWORD PTR [rbp-0x60],ymm0
d: c5 fc 29 4d 80 vmovaps YMMWORD PTR [rbp-0x80],ymm1
12: c5 fc 29 95 60 ff ff ff vmovaps YMMWORD PTR [rbp-0xa0],ymm2
1a: c5 fc 28 45 80 vmovaps ymm0,YMMWORD PTR [rbp-0x80]
1f: c5 fc 29 45 e0 vmovaps YMMWORD PTR [rbp-0x20],ymm0
24: c5 fc 28 85 60 ff ff ff vmovaps ymm0,YMMWORD PTR [rbp-0xa0]
2c: c5 fc 29 45 c0 vmovaps YMMWORD PTR [rbp-0x40],ymm0
31: c5 fc 28 4d c0 vmovaps ymm1,YMMWORD PTR [rbp-0x40]
36: c5 fc 28 45 e0 vmovaps ymm0,YMMWORD PTR [rbp-0x20]
3b: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
3f: c9 leave
40: c3 ret
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
LES/LDS cannot be encoded in 64-bit mode, that should make it easy to tell.
In 32-bit modes, both LDS/LES require a modR/M byte.A VEX-encoded instructionin 32-bit mode would havebits 7 and 6 of the equivalent modR/M byte equal to 11B (corresponding to a reserved form of modR/M encoding for LDS/LES, or an illegal form of LDS/LES). You can infer this from the definition of VEX.R and VEX.vvvv in Figure 4-2 of the spec.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's the performance dataI promosed. For the two versions below (small bug fix from the snippet above), looking at throughput for something like an inlined C=A*B, I get 19.3 cycles per 4x4 matrix multiply for the SSE2 version and 13.8 cycles per matrix multiply for the Intel AVX version, or 1.4X. That's for everything hitting in the first level cache. (Disclaimers apply: it's a pre-silicon simulator and the product isn't out yet, so treat this with some skepticism).
In this case both the AVX and SSE2 version's performance is limited by the shuffles (the broadcasts andperms below are all shuffle operations)- theyall execute on the same port(along with the branch at the end and some fraction of the loop counter updates). And in this code I only do about 64 iterations of the loop so there is some small overhead in the benchmark. So if you unroll, performance of both versions increases slightly. Maybe more importantly, if you can reuse any of those shuffles, for example if you had to code up
C= A*B
F= A*E
You would get larger gains. In this case, our simultor shows the AVX version 23.4 cycles (per two 4x4 matrix multiplies) while the SSE2 baseline is 36.9, so 1.6X.-----
This is the Intel AVX version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of
C= A*B
void
MMUL4x4_AVX(){
__asm {// b03 b02 b01 b00 | b03 b02 b01 b00mov ecx, 1024/16
lea rax, a
lea rbx, b
lea rdx, c
loop_a:
vbroadcastf128 ymm9, [rbx]
vbroadcastf128 ymm10, [rbx+16]
// b13 b12 b11 b10 | b13 b12 b11 b10vbroadcastf128 ymm11, [rbx+32]
// b23 b22 b21 b20 | b23 b22 b21 b20vbroadcastf128 ymm12, [rbx+48]
// b33 b32 b31 b30 | b33 b32 b31 b30vmovaps ymm0, [rax]
// a13 a12 a11 a10 | a03 a02 a01 a00vpermilps ymm1, ymm0, 0x00
// a 10 a10 a10 a10 | a00 a00 a00 a00vpermilps ymm2, ymm0, 0x55
// a11 a11 a11 a11 | a01 a01 a01 a01vpermilps ymm3, ymm0, 0xCC
// a12 a12 a12 a12 | a01 a02 a02 a02vpermilps ymm4, ymm0, 0xFF
// a13 a13 a13 a13 | a01 a03 a03 a03vmovaps ymm0, [rax+32]
// a33 a32 a31 a30 | a23 a22 a21 a20vpermilps ymm5, ymm0, 0x00
// a40 a30 a30 a30 | a20 a20 a20 a20vpermilps ymm6, ymm0, 0x55
// a41 a31 a31 a31 | a21 a21 a21 a21vpermilps ymm7, ymm0, 0xCC
// a42 a32 a32 a32 | a21 a22 a22 a22vpermilps ymm8, ymm0, 0xFF
// a43 a33 a33 a33 | a21 a23 a23 a23vmulps ymm1, ymm1, ymm9
vmulps ymm2, ymm2, ymm10
vmulps ymm3, ymm3, ymm11
vmulps ymm4, ymm4, ymm12
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm4
vaddps ymm1, ymm1, ymm3
vmulps ymm5, ymm5, ymm9
vmulps ymm6, ymm6, ymm10
vmulps ymm7, ymm7, ymm11
vmulps ymm8, ymm8, ymm12
vaddps ymm5, ymm5, ymm6
vaddps ymm7, ymm7, ymm8
vaddps ymm5, ymm5, ymm7
vmovaps [rdx], ymm1
vmovaps [rdx+32], ymm5
add rbx, 64
add rdx, 64
sub ecx, 1
jg loop_a
}
}
This is the Intel SSE2 version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of
C= A*B
void
MMUL4x4_SSE(){
__asm {; each iteration does one matrix mul (16 elements)
mov ecx, 1024/16
lea rax, a
lea rbx, b
lea rdx, c
loop_a:
// a00 a00 a00 a00movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00
pshufd xmm2, xmm0, 0x55
// a01 a01 a01 a01pshufd xmm3, xmm0, 0xcc
// a01 a02 a02 a02pshufd xmm4, xmm0, 0xFF
// a01 a03 a03 a03movaps xmm5, [rbx]
//b03 b02 b01 b00movaps xmm6, [rbx+16]
//b13 b12 b11 b10movaps xmm7, [rbx+32]
//b23 b22 b21 b20movaps xmm8, [rbx+48]
//b33 b32 b31 b30mulps xmm1, xmm5
//a00b03 a00b02 a00b01 a00b00mulps xmm2, xmm6
//a01b13 a01b12 a01b11 a01b10mulps xmm3, xmm7
//a02b23 a02b22 a02b21 a02b20mulps xmm4, xmm8
//a03b33 a03b32 a03b31 a03b30addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx], xmm1
movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00
// a10 a10 a10 a10pshufd xmm2, xmm0, 0x55
// a11 a11 a11 a11pshufd xmm3, xmm0, 0xcc
// a11 a12 a12 a12pshufd xmm4, xmm0, 0xFF
// a11 a13 a13 a13mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+16], xmm1
movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00
// a20 a20 a20 a20pshufd xmm2, xmm0, 0x55
// a21 a21 a21 a21pshufd xmm3, xmm0, 0xcc
// a21 a22 a22 a22pshufd xmm4, xmm0, 0xFF
// a21 a23 a23 a23mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+32], xmm1
movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00
// a30 a30 a30 a30pshufd xmm2, xmm0, 0x55
// a31 a31 a31 a31pshufd xmm3, xmm0, 0xcc
// a31 a32 a32 a32pshufd xmm4, xmm0, 0xFF
// a31 a33 a33 a33mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+48], xmm1
add rbx, 64
add rdx, 64
sub ecx, 1
jg loop_a
}
}
This is the Intel AVX version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of
C= A*B
F= A*E
void
MMUL4x4_AVX_2(){
__asm {// a13 a12 a11 a10 | a03 a02 a01 a00mov ecx, 1024/16
lea rax, a
lea rbx, b
lea rdx, c
lea rsi, e
lea rdi, f
loop_a:
vmovaps ymm0, [rax]
vpermilps ymm1, ymm0, 0x00
// a10 a10 a10 a10 | a00 a00 a00 a00vpermilps ymm2, ymm0, 0x55
// a11 a11 a11 a11 | a01 a01 a01 a01vpermilps ymm3, ymm0, 0xCC
// a12 a12 a12 a12 | a01 a02 a02 a02vpermilps ymm4, ymm0, 0xFF
// a13 a13 a13 a13 | a01 a03 a03 a03vmovaps ymm0, [rax+32]
// a33 a32 a31 a30 | a23 a22 a21 a20vpermilps ymm5, ymm0, 0x00
// a40 a30 a30 a30 | a20 a20 a20 a20vpermilps ymm6, ymm0, 0x55
// a41 a31 a31 a31 | a21 a21 a21 a21vpermilps ymm7, ymm0, 0xCC
// a42 a32 a32 a32 | a21 a22 a22 a22vpermilps ymm8, ymm0, 0xFF
// a43 a33 a33 a33 | a21 a23 a23 a23vbroadcastf128 ymm9, [rbx]
// b03 b02 b01 b00 | b03 b02 b01 b00vbroadcastf128 ymm10, [rbx+16]
// b13 b12 b11 b10 | b13 b12 b11 b10vbroadcastf128 ymm11, [rbx+32]
// b23 b22 b21 b20 | b23 b22 b21 b20vbroadcastf128 ymm12, [rbx+48]
// b33 b32 b31 b30 | b33 b32 b31 b30vmulps ymm0, ymm1, ymm9
vmulps ymm13, ymm2, ymm10
vaddps ymm0, ymm0, ymm13
vmulps ymm13, ymm3, ymm11
vmulps ymm14, ymm4, ymm12
vaddps ymm13, ymm13, ymm14
vaddps ymm0, ymm0, ymm13
vmovaps [rdx], ymm0
vmulps ymm0, ymm5, ymm9
vmulps ymm13, ymm6, ymm10
vaddps ymm0, ymm0, ymm13
vmulps ymm13, ymm7, ymm11
vmulps ymm14, ymm8, ymm12
vaddps ymm13, ymm13, ymm14
vaddps ymm0, ymm0, ymm13
vmovaps [rdx+32], ymm0
vbroadcastf128 ymm9, [rsi]
// b03 b02 b01 b00 | b03 b02 b01 b00vbroadcastf128 ymm10, [rsi+16]
// b13 b12 b11 b10 | b13 b12 b11 b10vbroadcastf128 ymm11, [rsi+32]
// b23 b22 b21 b20 | b23 b22 b21 b20vbroadcastf128 ymm12, [rsi+48]
// b33 b32 b31 b30 | b33 b32 b31 b30vmulps ymm1, ymm1, ymm9
vmulps ymm2, ymm2, ymm10
vmulps ymm3, ymm3, ymm11
vmulps ymm4, ymm4, ymm12
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm4
vaddps ymm1, ymm1, ymm3
vmulps ymm5, ymm5, ymm9
vmulps ymm6, ymm6, ymm10
vmulps ymm7, ymm7, ymm11
vmulps ymm8, ymm8, ymm12
vaddps ymm5, ymm5, ymm6
vaddps ymm7, ymm7, ymm8
vaddps ymm5, ymm5, ymm7
vmovaps [rdi], ymm1
vmovaps [rdi+32], ymm5
add rbx, 64
add rdx, 64
add rsi, 64
add rdi, 64
sub ecx, 1
jg loop_a
}
}
This is theIntel SSE2 baseline version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of
C= A*B
F= A*E
void MMUL4x4_SSE_2(){
__asm {
mov ecx, 1024/16
lea rax, a
lea rbx, b
lea rdx, c
lea rsi, e
lea rdi, f
loop_a:
// a00 a00 a00 a00movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00
pshufd xmm2, xmm0, 0x55
// a01 a01 a01 a01pshufd xmm3, xmm0, 0xcc
// a01 a02 a02 a02pshufd xmm4, xmm0, 0xFF
// a01 a03 a03 a03movaps xmm5, [rbx]
//b03 b02 b01 b00movaps xmm6, [rbx+16]
//b13 b12 b11 b10movaps xmm7, [rbx+32]
//b23 b22 b21 b20movaps xmm8, [rbx+48]
//b33 b32 b31 b30mulps xmm1, xmm5
//a00b03 a00b02 a00b01 a00b00mulps xmm2, xmm6
//a01b13 a01b12 a01b11 a01b10mulps xmm3, xmm7
//a02b23 a02b22 a02b21 a02b20mulps xmm4, xmm8
//a03b33 a03b32 a03b31 a03b30addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx], xmm1
movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00
// a10 a10 a10 a10pshufd xmm2, xmm0, 0x55
// a11 a11 a11 a11pshufd xmm3, xmm0, 0xcc
// a11 a12 a12 a12pshufd xmm4, xmm0, 0xFF
// a11 a13 a13 a13mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+16], xmm1
movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00
// a20 a20 a20 a20pshufd xmm2, xmm0, 0x55
// a21 a21 a21 a21pshufd xmm3, xmm0, 0xcc
// a21 a22 a22 a22pshufd xmm4, xmm0, 0xFF
// a21 a23 a23 a23mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+32], xmm1
movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00
// a30 a30 a30 a30pshufd xmm2, xmm0, 0x55
// a31 a31 a31 a31pshufd xmm3, xmm0, 0xcc
// a31 a32 a32 a32pshufd xmm4, xmm0, 0xFF
// a31 a33 a33 a33mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+48], xmm1
movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00
// a00 a00 a00 a00pshufd xmm2, xmm0, 0x55
// a01 a01 a01 a01pshufd xmm3, xmm0, 0xcc
// a01 a02 a02 a02pshufd xmm4, xmm0, 0xFF
// a01 a03 a03 a03movaps xmm5, [rsi]
//b03 b02 b01 b00movaps xmm6, [rsi+16]
//b13 b12 b11 b10movaps xmm7, [rsi+32]
//b23 b22 b21 b20movaps xmm8, [rsi+48]
//b33 b32 b31 b30mulps xmm1, xmm5
//a00b03 a00b02 a00b01 a00b00mulps xmm2, xmm6
//a01b13 a01b12 a01b11 a01b10mulps xmm3, xmm7
//a02b23 a02b22 a02b21 a02b20mulps xmm4, xmm8
//a03b33 a03b32 a03b31 a03b30addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdi], xmm1
movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00
// a10 a10 a10 a10pshufd xmm2, xmm0, 0x55
// a11 a11 a11 a11pshufd xmm3, xmm0, 0xcc
// a11 a12 a12 a12pshufd xmm4, xmm0, 0xFF
// a11 a13 a13 a13mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdi+16], xmm1
movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00
// a20 a20 a20 a20pshufd xmm2, xmm0, 0x55
// a21 a21 a21 a21pshufd xmm3, xmm0, 0xcc
// a21 a22 a22 a22pshufd xmm4, xmm0, 0xFF
// a21 a23 a23 a23mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdi+32], xmm1
movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00
// a30 a30 a30 a30pshufd xmm2, xmm0, 0x55
// a31 a31 a31 a31pshufd xmm3, xmm0, 0xcc
// a31 a32 a32 a32pshufd xmm4, xmm0, 0xFF
// a31 a33 a33 a33mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdi+48], xmm1
add rbx, 64
add rdx, 64
add rsi, 64
add rdi, 64
sub ecx, 1
jg loop_a
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Knut,
Please see the response from our engineering below:
An example from the XED tool discussed at IDF (https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf) is included below. For example, take a look at the first few bytes of VCMPPS. C5FCC2 the first byte is C4 or C5 (in 64-bit mode this is all thats required to know youre dealing with an AVX prefix); the second byte FC is the payoad; and C2 is the CMPPS opcode (same as before; subsequent bytes are also unchanged).
xed-i _mm256_cmpunord_ps.opt.vec.exe > dis
SYM subb:
XDIS 400a86: PUSH BASE 55 push rbp
XDIS 400a87: DATAXFER BASE 4889E5 mov rbp, rsp
XDIS 400a8a: LOGICAL BASE 4883E4E0 and rsp, 0xe0
XDIS 400a8e: DATAXFER BASE B8FFFFFFFF mov eax, 0xffffffff
XDIS 400a93: DATAXFER BASE 89051F381000 mov dword ptr[rip+0x10381f], eax
XDIS 400a99: DATAXFER BASE 890525381000 mov dword ptr[rip+0x103825], eax
XDIS 400a9f: AVX AVXC5FC100511381000 & nbsp; vmovups ymm0, ymmword ptr[rip+0x103811]
XDIS 400aa7: DATAXFER BASE 89053F381000 mov dword ptr[rip+0x10383f], eax
XDIS 400aad: DATAXFER BASE 890541381000 mov dword ptr[rip+0x103841], eax
XDIS 400ab3: AVX AVX C5FCC20D1C38100003 vcmpps ymm1, ymm0, ymmword ptr[rip+0x10381c], 0x3
XDIS 400abc: AVX AVX C5FC110D34381000 vmovups ymmword ptr[rip+0x103834], ymm1
XDIS 400ac4: LOGICAL BASE 33C0 xor eax, eax
XDIS 400ac6: AVX AVX C5FA1080B8425000 vmovss xmm0, dword ptr[rax+0x5042b8]
XDIS 400ace: LOGICAL BASE 33D2 xoredx, edx
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page