Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Welcome to the Intel(R) AVX Forum!

AaronTersteeg
Employee
2,029 Views
Please take a moment to read the papers and download the guide from the Intel AVX web site. If you have any questions about Intel AVX, AES, or SSE4.2 please aske your questions here and we will do our best to get you the infomration.
0 Kudos
8 Replies
urvabara
Beginner
2,029 Views

Hi!

http://www.anandtech.com/showdoc.aspx?i=3073&p=3

How about the matrix multiplication with Sandy Bridge? How many instructions does it need to do that?

Henri.

0 Kudos
knujohn4
New Contributor I
2,029 Views

Hi. I've quickly browsed the programmig reference, but I'm a little confused about the vex prefix and how it is encoded compared to the "normal" SSEx instructions. From chapter 4, "Elimination of escape opcode byte (0FH), SIMD Prefix byte (66H, F2H, F3H) ..." What I'm trying to figure out is how the instruction bytes will look if I'm looking at disassembley or debugging information of the AVX instructions. An example with one or two of the new instructions would be appreciated.

Knut Johnsen, Norway.

0 Kudos
Mark_B_Intel1
Employee
2,029 Views

Hi Henri,

I couldn't quite figure what that other site was trying to do - it looks like just a fragment of the whole thing and the baseline has unnecessary copies. Here's my first attempt at an AVX version. I coded up C = A*B and looped over C and B so we can look at the throughput.

The number of instructions doesn't really matter here (at least on Sandy Bridge) - it appears to be limited by the number of data rearrangements, or multiplies. There's half as many multiplies in the AVX version, but the extra broadcasts should reduce our performance scaling somewhat. (I'll be back to the states soon so I can run it through the performance simulator and post the results here)

Another neat thing about the AVX version is the extra state - you can imagine wanting to reuse the column broadcast on A and save all those computations - like if one had to compute C = A*B and D = A*E.

// AVX throughput test of 4x4 MMULs

void MMUL4x4_AVX()
{
__asm {

mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c

loop_a:
vmovaps ymm0, [rax]// a13 a12 a11 a10 | a03 a02 a01 a00
vpermilps ymm1, ymm0, 0x00// a10 a10 a10 a10 | a00 a00 a00 a00
vpermilps ymm2, ymm0, 0x55// a11 a11 a11 a11 | a01 a01 a01 a01
vpermilps ymm3, ymm0, 0xCC// a12 a12 a12 a12 | a01 a02 a02 a02
vpermilps ymm4, ymm0, 0xFF// a13 a13 a13 a13 | a01 a03 a03 a03

vmovaps ymm0, [rax+32]// a33 a32 a31 a30 | a23 a22 a21 a20
vpermilps ymm5, ymm0, 0x00// a40 a30 a30 a30 | a20 a20 a20 a20
vpermilps ymm6, ymm0, 0x55// a41 a31 a31 a31 | a21 a21 a21 a21
vpermilps ymm7, ymm0, 0xCC// a42 a32 a32 a32 | a21 a22 a22 a22
vpermilps ymm8, ymm0, 0xFF// a43 a33 a33 a33 | a21 a23 a23 a23

vbroadcastf128 ymm9, [rbx]// b03 b02 b01 b00 | b03 b02 b01 b00
vbroadcastf128 ymm10, [rbx+16]// b13 b12 b11 b10 | b13 b12 b11 b10
vbroadcastf128 ymm11, [rbx+32]// b23 b22 b21 b20 | b23 b22 b21 b20
vbroadcastf128 ymm12, [rbx+48]// b33 b32 b31 b30 | b33 b32 b31 b30

vmulps ymm1, ymm1, ymm9
vmulps ymm2, ymm2, ymm10
vmulps ymm3, ymm3, ymm11
vmulps ymm4, ymm4, ymm12
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm4
vaddps ymm1, ymm1, ymm3

vmulps ymm5, ymm5, ymm9
vmulps ymm6, ymm6, ymm10
vmulps ymm7, ymm7, ymm11
vmulps ymm8, ymm8, ymm12
vaddps ymm5, ymm5, ymm6
vaddps ymm7, ymm7, ymm8
vaddps ymm5, ymm5, ymm7

vmovaps [rdx], ymm1
vmovaps [rdx+32], ymm5

add rbx, 64
add rdx, 64

sub ecx, 1
jg loop_a
}
}


// Baseline for comparsion (can you beat this on SNB?)

void MMUL4x4_SSE()
{
__asm {

mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c

loop_a:
movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00
pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01
pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02
pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03

movaps xmm5, [rbx]//b03 b02 b01 b00
movaps xmm6, [rbx+16]//b13 b12 b11 b10
movaps xmm7, [rbx+32]//b23 b22 b21 b20
movaps xmm8, [rbx+48]//b33 b32 b31 b30

mulps xmm1, xmm5//a00b03 a00b02 a00b01 a00b00
mulps xmm2, xmm6//a01b13 a01b12 a01b11 a01b10
mulps xmm3, xmm7//a02b23 a02b22 a02b21 a02b20
mulps xmm4, xmm8//a03b33 a03b32 a03b31 a03b30
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx], xmm1

movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10
shufps xmm2, xmm0, 0x55 // a11 a11 a11 a11
shufps xmm3, xmm0, 0xcc // a11 a12 a12 a12
shufps xmm4, xmm0, 0xFF // a11 a13 a13 a13

mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mul ps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+16], xmm1

movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20
pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21
pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22
pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23

mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+32], xmm1

movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30
pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31
pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32
pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33

mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+48], xmm1

add rbx, 64
add rdx, 64

sub ecx, 1
jg loop_a
}

}

0 Kudos
knujohn4
New Contributor I
2,029 Views

Hello again! Some more info about this was posted at the IDF website some time after my last entry! You can see the PDF here: https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf

But the follow up question will be, how can you know (looking at the Bytes in the code segment) if the instructions with C4H / C5H is the VEX prefix or the LES / LDS instruction?

Knut

0 Kudos
HONGJIU_L_Intel
Employee
2,029 Views

The disassembler in the Linux binutils 2.18.50.0.6 or above:

http://www.kernel.org/pub/linux/devel/binutils/

supports AVX:

[hjl@gnu-6 avx-2]$ objdump -Mintel -dw x.o

x.o: file format elf64-x86-64


Disassembly of section .text:

0000000000000000 :
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 28 sub rsp,0x28
8: c5 fc 29 45 a0 vmovaps YMMWORD PTR [rbp-0x60],ymm0
d: c5 fc 29 4d 80 vmovaps YMMWORD PTR [rbp-0x80],ymm1
12: c5 fc 29 95 60 ff ff ff vmovaps YMMWORD PTR [rbp-0xa0],ymm2
1a: c5 fc 28 45 80 vmovaps ymm0,YMMWORD PTR [rbp-0x80]
1f: c5 fc 29 45 e0 vmovaps YMMWORD PTR [rbp-0x20],ymm0
24: c5 fc 28 85 60 ff ff ff vmovaps ymm0,YMMWORD PTR [rbp-0xa0]
2c: c5 fc 29 45 c0 vmovaps YMMWORD PTR [rbp-0x40],ymm0
31: c5 fc 28 4d c0 vmovaps ymm1,YMMWORD PTR [rbp-0x40]
36: c5 fc 28 45 e0 vmovaps ymm0,YMMWORD PTR [rbp-0x20]
3b: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
3f: c9 leave
40: c3 ret

0 Kudos
SHIH_K_Intel
Employee
2,029 Views

LES/LDS cannot be encoded in 64-bit mode, that should make it easy to tell.

In 32-bit modes, both LDS/LES require a modR/M byte.A VEX-encoded instructionin 32-bit mode would havebits 7 and 6 of the equivalent modR/M byte equal to 11B (corresponding to a reserved form of modR/M encoding for LDS/LES, or an illegal form of LDS/LES). You can infer this from the definition of VEX.R and VEX.vvvv in Figure 4-2 of the spec.

0 Kudos
Mark_B_Intel1
Employee
2,029 Views

Here's the performance dataI promosed. For the two versions below (small bug fix from the snippet above), looking at throughput for something like an inlined C=A*B, I get 19.3 cycles per 4x4 matrix multiply for the SSE2 version and 13.8 cycles per matrix multiply for the Intel AVX version, or 1.4X. That's for everything hitting in the first level cache. (Disclaimers apply: it's a pre-silicon simulator and the product isn't out yet, so treat this with some skepticism).

In this case both the AVX and SSE2 version's performance is limited by the shuffles (the broadcasts andperms below are all shuffle operations)- theyall execute on the same port(along with the branch at the end and some fraction of the loop counter updates). And in this code I only do about 64 iterations of the loop so there is some small overhead in the benchmark. So if you unroll, performance of both versions increases slightly. Maybe more importantly, if you can reuse any of those shuffles, for example if you had to code up

C= A*B

F= A*E

You would get larger gains. In this case, our simultor shows the AVX version 23.4 cycles (per two 4x4 matrix multiplies) while the SSE2 baseline is 36.9, so 1.6X.

-----

This is the Intel AVX version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of

C= A*B

void

MMUL4x4_AVX()

{

__asm {

mov ecx, 1024/16

lea rax, a

lea rbx, b

lea rdx, c

loop_a:

vbroadcastf128 ymm9, [rbx]

// b03 b02 b01 b00 | b03 b02 b01 b00

vbroadcastf128 ymm10, [rbx+16]

// b13 b12 b11 b10 | b13 b12 b11 b10

vbroadcastf128 ymm11, [rbx+32]

// b23 b22 b21 b20 | b23 b22 b21 b20

vbroadcastf128 ymm12, [rbx+48]

// b33 b32 b31 b30 | b33 b32 b31 b30

vmovaps ymm0, [rax]

// a13 a12 a11 a10 | a03 a02 a01 a00

vpermilps ymm1, ymm0, 0x00

// a 10 a10 a10 a10 | a00 a00 a00 a00

vpermilps ymm2, ymm0, 0x55

// a11 a11 a11 a11 | a01 a01 a01 a01

vpermilps ymm3, ymm0, 0xCC

// a12 a12 a12 a12 | a01 a02 a02 a02

vpermilps ymm4, ymm0, 0xFF

// a13 a13 a13 a13 | a01 a03 a03 a03

vmovaps ymm0, [rax+32]

// a33 a32 a31 a30 | a23 a22 a21 a20

vpermilps ymm5, ymm0, 0x00

// a40 a30 a30 a30 | a20 a20 a20 a20

vpermilps ymm6, ymm0, 0x55

// a41 a31 a31 a31 | a21 a21 a21 a21

vpermilps ymm7, ymm0, 0xCC

// a42 a32 a32 a32 | a21 a22 a22 a22

vpermilps ymm8, ymm0, 0xFF

// a43 a33 a33 a33 | a21 a23 a23 a23

vmulps ymm1, ymm1, ymm9

vmulps ymm2, ymm2, ymm10

vmulps ymm3, ymm3, ymm11

vmulps ymm4, ymm4, ymm12

vaddps ymm1, ymm1, ymm2

vaddps ymm3, ymm3, ymm4

vaddps ymm1, ymm1, ymm3

vmulps ymm5, ymm5, ymm9

vmulps ymm6, ymm6, ymm10

vmulps ymm7, ymm7, ymm11

vmulps ymm8, ymm8, ymm12

vaddps ymm5, ymm5, ymm6

vaddps ymm7, ymm7, ymm8

vaddps ymm5, ymm5, ymm7

vmovaps [rdx], ymm1

vmovaps [rdx+32], ymm5

add rbx, 64

add rdx, 64

sub ecx, 1

jg loop_a

}

}

This is the Intel SSE2 version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of

C= A*B

void

MMUL4x4_SSE()

{

__asm {

; each iteration does one matrix mul (16 elements)

mov ecx, 1024/16

lea rax, a

lea rbx, b

lea rdx, c

loop_a:

movaps xmm0, [rax]

pshufd xmm1, xmm0, 0x00

// a00 a00 a00 a00

pshufd xmm2, xmm0, 0x55

// a01 a01 a01 a01

pshufd xmm3, xmm0, 0xcc

// a01 a02 a02 a02

pshufd xmm4, xmm0, 0xFF

// a01 a03 a03 a03

movaps xmm5, [rbx]

//b03 b02 b01 b00

movaps xmm6, [rbx+16]

//b13 b12 b11 b10

movaps xmm7, [rbx+32]

//b23 b22 b21 b20

movaps xmm8, [rbx+48]

//b33 b32 b31 b30

mulps xmm1, xmm5

//a00b03 a00b02 a00b01 a00b00

mulps xmm2, xmm6

//a01b13 a01b12 a01b11 a01b10

mulps xmm3, xmm7

//a02b23 a02b22 a02b21 a02b20

mulps xmm4, xmm8

//a03b33 a03b32 a03b31 a03b30

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx], xmm1

movaps xmm0, [rax+16]

pshufd xmm1, xmm0, 0x00

// a10 a10 a10 a10

pshufd xmm2, xmm0, 0x55

// a11 a11 a11 a11

pshufd xmm3, xmm0, 0xcc

// a11 a12 a12 a12

pshufd xmm4, xmm0, 0xFF

// a11 a13 a13 a13

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+16], xmm1

movaps xmm0, [rax+32]

pshufd xmm1, xmm0, 0x00

// a20 a20 a20 a20

pshufd xmm2, xmm0, 0x55

// a21 a21 a21 a21

pshufd xmm3, xmm0, 0xcc

// a21 a22 a22 a22

pshufd xmm4, xmm0, 0xFF

// a21 a23 a23 a23

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+32], xmm1

movaps xmm0, [rax+48]

pshufd xmm1, xmm0, 0x00

// a30 a30 a30 a30

pshufd xmm2, xmm0, 0x55

// a31 a31 a31 a31

pshufd xmm3, xmm0, 0xcc

// a31 a32 a32 a32

pshufd xmm4, xmm0, 0xFF

// a31 a33 a33 a33

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+48], xmm1

add rbx, 64

add rdx, 64

sub ecx, 1

jg loop_a

}

}

This is the Intel AVX version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of

C= A*B

F= A*E

void

MMUL4x4_AVX_2()

{

__asm {

mov ecx, 1024/16

lea rax, a

lea rbx, b

lea rdx, c

lea rsi, e

lea rdi, f

loop_a:

vmovaps ymm0, [rax]

// a13 a12 a11 a10 | a03 a02 a01 a00

vpermilps ymm1, ymm0, 0x00

// a10 a10 a10 a10 | a00 a00 a00 a00

vpermilps ymm2, ymm0, 0x55

// a11 a11 a11 a11 | a01 a01 a01 a01

vpermilps ymm3, ymm0, 0xCC

// a12 a12 a12 a12 | a01 a02 a02 a02

vpermilps ymm4, ymm0, 0xFF

// a13 a13 a13 a13 | a01 a03 a03 a03

vmovaps ymm0, [rax+32]

// a33 a32 a31 a30 | a23 a22 a21 a20

vpermilps ymm5, ymm0, 0x00

// a40 a30 a30 a30 | a20 a20 a20 a20

vpermilps ymm6, ymm0, 0x55

// a41 a31 a31 a31 | a21 a21 a21 a21

vpermilps ymm7, ymm0, 0xCC

// a42 a32 a32 a32 | a21 a22 a22 a22

vpermilps ymm8, ymm0, 0xFF

// a43 a33 a33 a33 | a21 a23 a23 a23

vbroadcastf128 ymm9, [rbx]

// b03 b02 b01 b00 | b03 b02 b01 b00

vbroadcastf128 ymm10, [rbx+16]

// b13 b12 b11 b10 | b13 b12 b11 b10

vbroadcastf128 ymm11, [rbx+32]

// b23 b22 b21 b20 | b23 b22 b21 b20

vbroadcastf128 ymm12, [rbx+48]

// b33 b32 b31 b30 | b33 b32 b31 b30

vmulps ymm0, ymm1, ymm9

vmulps ymm13, ymm2, ymm10

vaddps ymm0, ymm0, ymm13

vmulps ymm13, ymm3, ymm11

vmulps ymm14, ymm4, ymm12

vaddps ymm13, ymm13, ymm14

vaddps ymm0, ymm0, ymm13

vmovaps [rdx], ymm0

vmulps ymm0, ymm5, ymm9

vmulps ymm13, ymm6, ymm10

vaddps ymm0, ymm0, ymm13

vmulps ymm13, ymm7, ymm11

vmulps ymm14, ymm8, ymm12

vaddps ymm13, ymm13, ymm14

vaddps ymm0, ymm0, ymm13

vmovaps [rdx+32], ymm0

vbroadcastf128 ymm9, [rsi]

// b03 b02 b01 b00 | b03 b02 b01 b00

vbroadcastf128 ymm10, [rsi+16]

// b13 b12 b11 b10 | b13 b12 b11 b10

vbroadcastf128 ymm11, [rsi+32]

// b23 b22 b21 b20 | b23 b22 b21 b20

vbroadcastf128 ymm12, [rsi+48]

// b33 b32 b31 b30 | b33 b32 b31 b30

vmulps ymm1, ymm1, ymm9

vmulps ymm2, ymm2, ymm10

vmulps ymm3, ymm3, ymm11

vmulps ymm4, ymm4, ymm12

vaddps ymm1, ymm1, ymm2

vaddps ymm3, ymm3, ymm4

vaddps ymm1, ymm1, ymm3

vmulps ymm5, ymm5, ymm9

vmulps ymm6, ymm6, ymm10

vmulps ymm7, ymm7, ymm11

vmulps ymm8, ymm8, ymm12

vaddps ymm5, ymm5, ymm6

vaddps ymm7, ymm7, ymm8

vaddps ymm5, ymm5, ymm7

vmovaps [rdi], ymm1

vmovaps [rdi+32], ymm5

add rbx, 64

add rdx, 64

add rsi, 64

add rdi, 64

sub ecx, 1

jg loop_a

}

}

This is theIntel SSE2 baseline version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of

C= A*B

F= A*E

void MMUL4x4_SSE_2()

{

__asm {

mov ecx, 1024/16

lea rax, a

lea rbx, b

lea rdx, c

lea rsi, e

lea rdi, f

loop_a:

movaps xmm0, [rax]

pshufd xmm1, xmm0, 0x00

// a00 a00 a00 a00

pshufd xmm2, xmm0, 0x55

// a01 a01 a01 a01

pshufd xmm3, xmm0, 0xcc

// a01 a02 a02 a02

pshufd xmm4, xmm0, 0xFF

// a01 a03 a03 a03

movaps xmm5, [rbx]

//b03 b02 b01 b00

movaps xmm6, [rbx+16]

//b13 b12 b11 b10

movaps xmm7, [rbx+32]

//b23 b22 b21 b20

movaps xmm8, [rbx+48]

//b33 b32 b31 b30

mulps xmm1, xmm5

//a00b03 a00b02 a00b01 a00b00

mulps xmm2, xmm6

//a01b13 a01b12 a01b11 a01b10

mulps xmm3, xmm7

//a02b23 a02b22 a02b21 a02b20

mulps xmm4, xmm8

//a03b33 a03b32 a03b31 a03b30

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx], xmm1

movaps xmm0, [rax+16]

pshufd xmm1, xmm0, 0x00

// a10 a10 a10 a10

pshufd xmm2, xmm0, 0x55

// a11 a11 a11 a11

pshufd xmm3, xmm0, 0xcc

// a11 a12 a12 a12

pshufd xmm4, xmm0, 0xFF

// a11 a13 a13 a13

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+16], xmm1

movaps xmm0, [rax+32]

pshufd xmm1, xmm0, 0x00

// a20 a20 a20 a20

pshufd xmm2, xmm0, 0x55

// a21 a21 a21 a21

pshufd xmm3, xmm0, 0xcc

// a21 a22 a22 a22

pshufd xmm4, xmm0, 0xFF

// a21 a23 a23 a23

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+32], xmm1

movaps xmm0, [rax+48]

pshufd xmm1, xmm0, 0x00

// a30 a30 a30 a30

pshufd xmm2, xmm0, 0x55

// a31 a31 a31 a31

pshufd xmm3, xmm0, 0xcc

// a31 a32 a32 a32

pshufd xmm4, xmm0, 0xFF

// a31 a33 a33 a33

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdx+48], xmm1

movaps xmm0, [rax]

pshufd xmm1, xmm0, 0x00

// a00 a00 a00 a00

pshufd xmm2, xmm0, 0x55

// a01 a01 a01 a01

pshufd xmm3, xmm0, 0xcc

// a01 a02 a02 a02

pshufd xmm4, xmm0, 0xFF

// a01 a03 a03 a03

movaps xmm5, [rsi]

//b03 b02 b01 b00

movaps xmm6, [rsi+16]

//b13 b12 b11 b10

movaps xmm7, [rsi+32]

//b23 b22 b21 b20

movaps xmm8, [rsi+48]

//b33 b32 b31 b30

mulps xmm1, xmm5

//a00b03 a00b02 a00b01 a00b00

mulps xmm2, xmm6

//a01b13 a01b12 a01b11 a01b10

mulps xmm3, xmm7

//a02b23 a02b22 a02b21 a02b20

mulps xmm4, xmm8

//a03b33 a03b32 a03b31 a03b30

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdi], xmm1

movaps xmm0, [rax+16]

pshufd xmm1, xmm0, 0x00

// a10 a10 a10 a10

pshufd xmm2, xmm0, 0x55

// a11 a11 a11 a11

pshufd xmm3, xmm0, 0xcc

// a11 a12 a12 a12

pshufd xmm4, xmm0, 0xFF

// a11 a13 a13 a13

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdi+16], xmm1

movaps xmm0, [rax+32]

pshufd xmm1, xmm0, 0x00

// a20 a20 a20 a20

pshufd xmm2, xmm0, 0x55

// a21 a21 a21 a21

pshufd xmm3, xmm0, 0xcc

// a21 a22 a22 a22

pshufd xmm4, xmm0, 0xFF

// a21 a23 a23 a23

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdi+32], xmm1

movaps xmm0, [rax+48]

pshufd xmm1, xmm0, 0x00

// a30 a30 a30 a30

pshufd xmm2, xmm0, 0x55

// a31 a31 a31 a31

pshufd xmm3, xmm0, 0xcc

// a31 a32 a32 a32

pshufd xmm4, xmm0, 0xFF

// a31 a33 a33 a33

mulps xmm1, xmm5

mulps xmm2, xmm6

mulps xmm3, xmm7

mulps xmm4, xmm8

addps xmm1, xmm2

addps xmm3, xmm4

addps xmm1, xmm3

movaps [rdi+48], xmm1

add rbx, 64

add rdx, 64

add rsi, 64

add rdi, 64

sub ecx, 1

jg loop_a

}

}

0 Kudos
Quoc-Thai_L_Intel
2,029 Views

Hi Knut,

Please see the response from our engineering below:

An example from the XED tool discussed at IDF (https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf) is included below. For example, take a look at the first few bytes of VCMPPS. C5FCC2 the first byte is C4 or C5 (in 64-bit mode this is all thats required to know youre dealing with an AVX prefix); the second byte FC is the payoad; and C2 is the CMPPS opcode (same as before; subsequent bytes are also unchanged).

xed-i _mm256_cmpunord_ps.opt.vec.exe > dis

SYM subb:

XDIS 400a86: PUSH BASE 55 push rbp

XDIS 400a87: DATAXFER BASE 4889E5 mov rbp, rsp

XDIS 400a8a: LOGICAL BASE 4883E4E0 and rsp, 0xe0

XDIS 400a8e: DATAXFER BASE B8FFFFFFFF mov eax, 0xffffffff

XDIS 400a93: DATAXFER BASE 89051F381000 mov dword ptr[rip+0x10381f], eax

XDIS 400a99: DATAXFER BASE 890525381000 mov dword ptr[rip+0x103825], eax

XDIS 400a9f: AVX AVXC5FC100511381000 & nbsp; vmovups ymm0, ymmword ptr[rip+0x103811]

XDIS 400aa7: DATAXFER BASE 89053F381000 mov dword ptr[rip+0x10383f], eax

XDIS 400aad: DATAXFER BASE 890541381000 mov dword ptr[rip+0x103841], eax

XDIS 400ab3: AVX AVX C5FCC20D1C38100003 vcmpps ymm1, ymm0, ymmword ptr[rip+0x10381c], 0x3

XDIS 400abc: AVX AVX C5FC110D34381000 vmovups ymmword ptr[rip+0x103834], ymm1

XDIS 400ac4: LOGICAL BASE 33C0 xor eax, eax

XDIS 400ac6: AVX AVX C5FA1080B8425000 vmovss xmm0, dword ptr[rax+0x5042b8]

XDIS 400ace: LOGICAL BASE 33D2 xoredx, edx

0 Kudos
Reply