FMA manipulation of register’s content for XMM, YMM and ZMM register sets

Mile_M_ · ‎01-21-2014

hello, there wasn’t a typical introduction thread so since it’s my first post i though to introduce myself. my name is mile (yes like the measuring unit) and i’m a student. i’m noob in this area.

i’m writing a paper for school and before posting my question(s) here i’ve thoroughly researched for an answer online to the best of my abilities but i didn’t managed to find one. after browsing the forum i’ve decided to post in new topic instead going off topic in another one.

during my research some things cleared up to me but i still couldn’t find clearly defined answers. i’ll probably have some trivial questions and silly assumptions so please correct me if i’m wrong or if i’m missing something.

i’ve spent a lot of time drawing the diagrams and writing со if someone can help i would appreciate it greatly! i’m having a mentor studies where we don’t have lectures but isnread the professor gives us a topic which we study on our own. unfortunately i can’t bother my professor with this type of questions.

OK, so anyone willing to help lets continue. i’ll present my questions during my story as we go along.

+++++++++++++++++++++++++++++++

an article i’ve found about SSE extensions explained how a MUL operation manipulates the content of the 4 registry elements (0-31, 32-64, 65-96, 97-127) for the 128 bit wide XMM registry set.

MULPS (packed single precision >> ps) operation takes all elements from XMM0 register, multiplies each of them with each element with the identical index from XMM1 and stores the multiplication result(s) in XMM1’s element(s).

MULSS (scalar single precision >> ss) operation takes only the first element (least significant element) from XMM0, multiplies it with the first element from XMM1, puts the result back in the first element of XMM1 and copies the values from the other 3 elements (32-63, 64-95, 96-127) from XMM0 to XMM1.

-the width of each element is 32 bits because we are talking about single precision (32 bits)

- the register that will store the result will be the one that is holding Source 2 operand (b) >> XMM1

-because it is scalar we are multiply only first lower element

-this is XMM register set because the width of each register is 128 bit

what i would like to know is how will each type vFMAdd operation (ss, sd, ps, pd) manipulate with the content of the registers for the 2 available registry sets (XMM/128 bit and YMM/256 bit)? i will divide this question into several over the course of this post.

for the purpose of simplicity we’ll use only the vFMAdd231<xx> (xx >> ss, sd, ps, pd) operand ordering.

Source 1 = xmm0, Source 2 = xmm1, Source 3 = xmm2

Source 1 is always the destination register

Source 3 can be memory (mem256) or register (xmm/ymm),

POSSIBLE PERMUTATIONS

+++

vfmadd231 xmm0 , xmm1, xmm2 (WE’LL USE THIS AS DEFAULT OPERAND ORDERING)

a=xmm1, b=xmm2, c=xmm0

xmm0 = (xmm1 * xmm2) + xmm0 (src 1 >> xmm0 is destination >> Old value of C is overwritten)

vfmadd213 xmm0 , xmm1, xmm2 (WE WON’T USE THIS ORDERRING)

a=xmm1, b=xmm0, c=xmm2

xmm0 = (xmm1 * xmm0) + xmm2 (xmm0 is overwritten >> old value of B is overwritten)

vfmadd132 xmm0 , xmm1, xmm2 (WE WON’T USE THIS ORDERRING)

a=xmm0, b=xmm2, c=xmm1

xmm0 = (xmm0 * xmm2) + xmm1 (xmm0 is overwritten >> old value of A is overwritten)

bellow i’ve presented the scenarios i’m puzzled about or in need of verification.

1. fmadd231_ss xmm0, xmm1, xmm2 >> (src 1 = xmm0, src 2 = xmm1, src 3 = xmm2)

abc=231 >> a=src 2=xmm1, b=src 3=xmm2, c=src 1=xmm0

src 1 = (a*b)+c = (src 2 * src3) + src 1 = xmm0 = (xmm1 * xmm2) + xmm0

src 1 = xmm0 = c = destination >> Old value of C is overwritten

-the width of each element is 32 bits because we are talking about single precision (32 bits)

-because it is scalar we are fused-multiply-adding only the first lower element

-because we use 231 ordering the register that will store the result will be XMM0

-this is XMM register set because the width of each register is 128 bits

QUESTION 1.1. the first element will have the value 8 (1*5+3=8) but what happens with the other 3 elements? are they going to take the values from the 3 elements from XMM1, from XMM2 or the identical values from XMM0 will remain?

mul_ss copies the other 3 values from the source register but what happens with fma_ss where there is multiplication and addition also? my guess is the elements will take the values from the elements of register XMM2 (b), is this correct or not?

QUESTION 1.2 fmadd_ss works with the XMM register set but can it work with the YMM register set?

in case this is possible what will happen? the YMM register which is 256 bits will be divided into 8 32-bits long elements from which only the first element will be subject to FMA calculation while the rest will take the value from >>need answer from QUESTION 1.1<<. is this right?

NOTE: for packed operations it is pretty straightforward because you copy all elements, it shouldn’t matter whether all elements from XMM, all elements from YMM or all elements from ZMM (future AVX-512 extensions). however the scalar ones aren’t clearly defined in intel’s reference guide.

most likely i’m not aware of a certain well known behavior(s) that happen during the basic operations so i would appreciate if someone could sheds some light on this.

2. fmadd231_sd xmm0, xmm1, xmm2 >> (src 1 = xmm0, src 2 = xmm1, src 3 = xmm2)

abc=231 >> a=src 2=xmm1, b=src 3=xmm2, c=src 1=xmm0

src 1 = (a*b)+c = (src 2 * src3) + src 1 = xmm0 = (xmm1 * xmm2) + xmm0

src 1 = xmm0 = c = destination >> Old value of C is overwritten

-the width of each element is 64 bits because we are talking about double precision (64 bits)

-because it is scalar we are fused-multiply-adding only the first lower element

-because we use 231 ordering the register that will store the result will be XMM0

-this is XMM register set because the width of each register is 128 bits

QUESTION 2.1 at this point i still don’t know what value does the other element would take and i am still wondering whether this version of the FMA operation (fmadd_sd) can work on the YMM register set?

if it is possible the YMM register will be divided in 4 64-bits long elements from which only the first element will be subject to FMA calculation while the others will take the value from >>need answer from QUESTION 1.1<<. correct?

QUESTION 2.2 what about if we use the fmadd_sd on ZMM set? will it be 8 elements each 64 bits long?

3. fmadd231_ps xmm0, xmm1, xmm2 >> (src 1 = xmm0, src 2 = xmm1, src 3 = xmm2)

abc=231 >> a=src 2=xmm1, b=src 3=xmm2, c=src 1=xmm0

src 1 = (a*b)+c = (src 2 * src3) + src 1 = xmm0 = (xmm1 * xmm2) + xmm0

src 1 = xmm0 = c = destination >> Old value of C is overwritten

-the width of each element is 32 bits because we are talking about single precision (32 bits)

-because it is packed we are fused-multiply-adding all elements

-because we use 231 ordering the register that will store the result will be XMM0

-this is XMM register set because the width of each register is 128 bits

over here i’m pretty sure i got the diagram right. it is the 128 bits wide XMM register set and all the elements are subject to the FMA calculations however i still have a questions.

QUESTION 3.1. how will the diagram look if we are using the YMM register set? in that case the 256 bit long register will be divided into 8 32-bits long elements and all of them will be subject to FMA calculations, correct?

4. fmadd231_pd xmm0, xmm1, xmm2 >> (src 1 = xmm0, src 2 = xmm1, src 3 = xmm2)

abc=231 >> a=src 2=xmm1, b=src 3=xmm2, c=src 1=xmm0

src 1 = (a*b)+c = (src 2 * src3) + src 1 = xmm0 = (xmm1 * xmm2) + xmm0

src 1 = xmm0 = c = destination >> Old value of C is overwritten

-the width of each element is 64 bits because we are talking about double precision (64 bits)

-because it is packed we are fused-multiply-adding all elements

-because we use 231 ordering the register that will store the result will be XMM0

-this is XMM register set because the width of each register is 128 bits

QUESTION 4.1. how will the diagram look if we are using the YMM register set? in that case the 256 bit long register will be divided into 4 64-bits long elements and all of them will be subject to FMA calculations, correct? will it look like this?

-the width of each element is 64 bits because we are talking about double precision (64 bits)

-because it is packed we are fused-multiply-adding all elements

-because we use 231 ordering the register that will store the result will be XMM0

-this is YMM register set because the width of each register is 256 bits

QUESTION 5: how do the FMA operations work with integer numbers? what is the difference?

QUESTION 6: is it possible to use illegal operation like in the past ISA extensions? for example what if we try to use vfmadd123<xx> which doesn’t exists. let’s calculate first:

vfmadd123 xmm0 , xmm1, xmm2

a=xmm0, b=xmm2, c=xmm1

xmm0 = (xmm0 * xmm1) + xmm2 (xmm0 is overwritten >> old value of A is overwritten)

so hypothetically this intrisic would yield the same result as using the vfmadd132<xx> and is therefore redundant and unnecessary but the question is whether the compiler will accept it and process it?

sorry for any grammar mistakes. i’ll fix them in case something loses its meaning/definition.

sorry for the non uniform use of parenthesis and capitalization of some character (a,b & c VS A,B, & C). i’ve used them somewhere in order to increase the readability when i was getting confused!

i’ve posted this after checked it couple of times but i’ve been working for too long and i have to sleep now. tomorrow i’ll read proof this post again but in the meantime hopefully someone will post usefull information.

thanks. mile

FMA manipulation of register’s content for XMM, YMM and ZMM register sets

+++

hello, there wasn’t a typical introduction thread so since it’s my first post i though to introduce myself. my name is mile (yes like the measuring unit) and i’m a student. i’m noob in this area.

i’m writing a paper for school and before posting my question(s) here i’ve thoroughly researched for an answer online to the best of my abilities but i didn’t managed to find one. after browsing the forum i’ve decided to post in new topic instead going off topic in another one.

during my research some things cleared up to me but i still couldn’t find clearly defined answers. i’ll probably have some trivial questions and silly assumptions so please correct me if i’m wrong or if i’m missing something.

i’ve spent a lot of time drawing the diagrams and writing со if someone can help i would appreciate it greatly! i’m having a mentor studies where we don’t have lectures but isnread the professor gives us a topic which we study on our own. unfortunately i can’t bother my professor with this type of questions.

OK, so anyone willing to help lets continue. i’ll present my questions during my story as we go along.

+++

an article i’ve found about SSE extensions explained how a MUL operation manipulates the content of the 4 registry elements (0-31, 32-64, 65-96, 97-127) for the 128 bit wide XMM registry set.

MULPS (packed single precision >> ps) operation takes all elements from XMM0 register, multiplies each of them with each element with the identical index from XMM1 and stores the multiplication result(s) in XMM1’s element(s).

MULSS (scalar single precision >> ss) operation takes only the first element (least significant element) from XMM0, multiplies it with the first element from XMM1, puts the result back in the first element of XMM1 and copies the values from the other 3 elements (32-63, 64-95, 96-127) from XMM0 to XMM1.

-the width of each element is 32 bits because we are talking about single precision (32 bits)

- the register that will store the result will be the one that is holding Source 2 operand (b) >> XMM1

-because it is scalar we are multiply only first lower element

-this is XMM register set because the width of each register is 128 bit

what i would like to know is how will each type vFMAdd operation (ss, sd, ps, pd) manipulate with the content of the registers for the 2 available registry sets (XMM/128 bit and YMM/256 bit)? i will divide this question into several over the course of this post.

for the purpose of simplicity we’ll use only the vFMAdd231<xx> (xx >> ss, sd, ps, pd) operand ordering.

Source 1 = xmm0, Source 2 = xmm1, Source 3 = xmm2

Source 1 is always the destination register

Source 3 can be memory (mem256) or register (xmm/ymm),

POSSIBLE PERMUTATIONS

+++

vfmadd231 xmm0 , xmm1, xmm2 (WE’LL USE THIS AS DEFAULT OPERAND ORDERING)

a=xmm1, b=xmm2, c=xmm0

xmm0 = (xmm1 * xmm2) + xmm0 (src 1 >> xmm0 is destination >> Old value of C is overwritten)

vfmadd213 xmm0 , xmm1, xmm2 (WE WON’T USE THIS ORDERRING)

a=xmm1, b=xmm0, c=xmm2

xmm0 = (xmm1 * xmm0) + xmm2 (xmm0 is overwritten >> old value of B is overwritten)

vfmadd132 xmm0 , xmm1, xmm2 (WE WON’T USE THIS ORDERRING)

a=xmm0, b=xmm2, c=xmm1

xmm0 = (xmm0 * xmm2) + xmm1 (xmm0 is overwritten >> old value of A is overwritten)

bellow i’ve presented the scenarios i’m puzzled about or in need of verification.

1. fmadd231_ss xmm0, xmm1, xmm2 >> (src 1 = xmm0, src 2 = xmm1, src 3 = xmm2)

abc=231 >> a=src 2=xmm1, b=src 3=xmm2, c=src 1=xmm0

src 1 = (a*b)+c = (src 2 * src3) + src 1 = xmm0 = (xmm1 * xmm2) + xmm0

src 1 = xmm0 = c = destination >> Old value of C is overwritten