question about intrinsics

levicki · ‎03-20-2008

Can someone from Intel explain to me why the sintax of ROUNDPS and ROUNDSS intrinsics is different?

	__m128	p0, p1;

	// this is expected
	p1 = _mm_round_ps(p0, 3);

	// but why we need another parameter here?
	p1 = _mm_round_ss(p1, p0, 3);

David_K_Intel1 · ‎03-21-2008

Your question is understandable, because these intrinsics are poorly documented in our Users' Guide. I will try to ensure that problem is addressed in a future release of the compiler.

To illustrate the behavior of these intrinsics, think of the inputs and outputs as arrays of 4 floats. Then we have the following:

r = _mm_round_ps(a, 3);

r[0] = round(a[0])

r[1] = round(a[1])

r[2] = round(a[2])

r[3] = round(a[3])

r = _mm_round_ss(a, b, 3);

r[0] = round(b[0]);

r[1] = a[1];

r[2] = a[2];

r[3] = a[3];

So, the purpose of the extra operand is to provide the upper 3 float elements. We implement the intrinsic this way to provide the full generality of the underlying ROUNDSS instruction, which passes through the upper elements from the result register. In many cases, you will not care about the values in the upper elements. In those situations, I would recommend using _mm_round_ss(b, b, 3) or possibly _mm_round_ss(_mm_setzero_ps(), b, 3);

David Kreitzer

IA32 and Intel 64 Code Generation

Intel Compiler Lab

levicki · ‎03-21-2008

David I strongly disagree and object:

It is redundant, because if you want original instruction behavior you must write r twice r = _mm_round_ss(r, b, 3).
Intel Architecture Manual says the following for ROUNDSS: "The upper three single-precision floating-point values in the destination are retained". _mm_round_ss() intrinsic thus deviates from the behavior of the underlying instruction because writing r = _mm_round_ss(a, b, 3) overwrites the destination. That is bound to cause confusion and in my opinion it isn't doing us developers any favor.

Last time I checked you couldn't write ROUNDSS XMM2, XMM1, XMM0, 3 in assembler unless you wrote a macro. I know there are some intrinsics which are artificial and composed from more than one instruction like _mm_set1_ps() for example that is fine with me, but (and I cannot stress this enough) those which represent actual instructions should map 1:1.

There are always some wrong narrow minded people sitting in important places and making such insensible decisions. I am sick of it and I often ask myself "why bother, why try to help, to change and improve anything when they are bound to ruin it in the next revision?"

David_K_Intel1 · ‎03-21-2008

Igor,

Let me respond by asking a question. Youare claimingthat _mm_round_ss should take only one __m128 argument. What value do you think 'a' should have after the following assignment statement?

a = _mm_add_ps(b, _mm_round_ss(c, 3));

David Kreitzer

levicki · ‎03-21-2008

David, this is what would happen:

	a[0] = b[0] + round(c[0]);
	a[1] = b[1] + c[1];
	a[2] = b[2] + c[2];
	a[3] = b[3] + c[3];

Now please be so kind to explain the following:

What exactly are you trying to accomplish numerically?
How does the second __m128 argument help you to get the desired result?
How would you go about writing the same code in assembler in the absence of intrinsics?

Now I am really curious.

David_K_Intel1 · ‎03-22-2008

I have one more follow-up question that hopefully will motivate the need for the second __m128 argument. Your previous response defined the _mm_round_ss semantics like this:

__m128 _mm_round_ss(__m128 a, imm i);

r[0] = round(a[0])

r[1] = a[1]

r[2] = a[2]

r[3] = a[3]

Now, suppose you wanted to implement the follow assembly code sequence using intrinsics:

movaps    xmm0, d
roundss   xmm0, c
   addps     xmm0, b
   movaps    a, xmm0

How would you do it using the _mm_round_ss semantics that you defined?

David Kreitzer

levicki · ‎03-22-2008

Your previous response defined the _mm_round_ss semantics like this...

I have made a mistake in that post because your expression didn't have explicit destination, this is what I should have written in that previous response you are mentioning:

	x[3:0] = 0; // temporary implicit destination
	x[0] = round(c[0]);
	a[0] = b[0] + x[0];
	a[1] = b[1] + x[1];
	a[2] = b[2] + x[2];
	a[3] = b[3] + x[3];

So I am actually saying it should work like this:

	r = _mm_round_ss(a, 3);

	r[0] = round(a[0]);
	r[1] = r[1]; // r[3:1] are unchanged
	r[2] = r[2];
	r[3] = r[3];

ROUNDSS instruction preserves the upper three floats of the destination operand, not the source operand (a is the source operand).

You assembler code could be written like this using the syntax I suggested:

	__m128 v0;

	v0 = _mm_load_ps(d);
	v0 = _mm_round_ss(c, 3);
	v0 = _mm_add_ps(v0, b);
	_mm_store_ps(a, v0);

You have three cases to consider:

explicit assignment to another __m128 variable like v0 = _mm_round_ss(c, 3); then you have v0[0] = round(c[0]); and v0[3:1] unchanged, the same as if you wrote ROUNDSS xmm1, xmm0, 3 (where xmm1 = v0 and xmm0 = c).
explicit assignment to the same __m128 variable like in c = _mm_round_ss(c, 3); it behaves as if you did ROUNDSS xmm0, xmm0, 3 (where xmm0 = c).
implicit destination like in v0 = _mm_add_ps(_mm_round_ss(c, 3), b); this would use temporary variable preinitialized to zero for implicit destination.

What I would like to know is how does current form of _mm_round_ss() support memory operand for the source value?

David_K_Intel1 · ‎03-22-2008

But that intrinsic implementation does not match my assembly sequence. Using the semantics that you suggested for _mm_round_ss, after

v0 = _mm_round_ss(c, 3)

v0[0] = round(c[0])

v0[1] = c[1]

v0[2] = c[2]

v0[3] = c[3]

The previous value of v0 (the result of the _mm_load_ps intrinsic) is overwritten, so d[1], d[2], and d[3] are lost.

David Kreitzer

levicki · ‎03-22-2008

David, I edited the post to clarify. See above.

David_K_Intel1 · ‎03-23-2008

Igor,

I can tell that you are an assembly programmer. The intrinsic functions are designed to behave like normal C functions, but what you described is something significantly different. In your proposal, the value returned by _mm_round_ss(c, 3)depends on the context in which it is called! That is an extremely undesirable property.

The problem is that you are thinking of ROUNDSS as having oneinput operand and one output operand, but that is not the case. In fact, it has two input operands and one output operand where one input operand is required to go in the same register as the output operand. It is just like the other two-operand instructions like ADDPS and ADD. The machine actually implements it that way. It reads both operand registers and then writes the destination register. It does not read only the source operand and then do a partial write to the destination, even though the instruction description might lead you to believe that. The fact that the machine reads the destination register often leads to interesting performance problems. I've attached a section of text that I wrote for a technical journal article that is pending publication. It explains the problem.

The intrinsic functions are designed to abstract the functionality of the instruction set architecture. The fact that some instructions clobber one of their operands is hidden in the intrinsic API.For example,you can write "x = _mm_add_ps(y, z)" even though ADDPS clobbers one of its input operands. The same is true of the memory instruction forms, which you asked about. It is the job of the compiler to identify opportunities to use the memory instruction forms. The programmer should not have to worry about it. And the compiler usually does a pretty good job. For example, try the following:

__m128 x, y, z;

void fn()
{
x = _mm_round_ss(y, z, 3);
}

David Kreitzer

In addition to vector instructions, the Streaming SIMD Extensions provide instructions for scalar floating-point operations. These scalar instructions use the same register set as the vector instructions and typically operate only on the least significant vector element. The upper elements are either preserved in the destination operand or set to zero, depending on the instruction. In cases where the upper elements are preserved, false data dependences can occur when the result of an instruction does not otherwise depend on the destination operand. These false data dependences can lengthen the critical path and severely degrade performance, especially when the false dependences are carried around the back edge of a loop. Consider, for instance, the following loop with independent loop iterations.

float *floats;

double *doubles; /* FALSE-DEP EXAMPLE */
...
for (i = 0; i < N; i++) {

doubles = (double)floats / x;
}

The compiler could vectorize this loop, but it may choose not to due to the unknown alignment of the float and double arrays. As shown in the first column of Table V, a generic scalar implementation uses a scalar single to double conversion (cvtss2sd) followed by a scalar division (divsd). The former instruction preserves the upper 64-bit double element in its destination operand, so it has a data dependence on the destination register. In this case, the dependence is false, because these upper bits are never really used. Nonetheless, the cvtss2sd must wait for the result of the divsd operation from the previous loop iteration to be available. The Intel compiler avoids such false dependences. As shown in the second column of Table V, the compiler accomplishes this by using a vector conversion (cvtps2pd). The result of this instruction depends solely on its input operand, not its destination operand. The optimized version runs about 50% faster than the generic implementation when running on an Intel Core 2 Duo processor.

Table V. False-Dependence Removal

Generic Implementation

Core Microarchitecture Implementation

L: cvtss2sd xmm2, [_floats+rax*4]

divsd xmm2, xmm0

movsd [_doubles+rax*8], xmm2

add rax, 1

cmp rax, 1024

jl L

L: movss xmm1, [_floats+rax*4]

cvtps2pd xmm2, xmm1

divsd xmm2, xmm0

movsd [_doubles+rax*8], xmm2

add rax, 1

cmp rax, 1024

jl L

levicki · ‎03-24-2008

True, I use assembler a bit more than I use C/C++.

The problem is that you are thinking of ROUNDSS as having one input operand and one output operand, but that is not the case.

Well in my opinion hidden input operand is not the same as real input operand. The difference between those two is that the real input operand can be reused as a source for multiple instructions, while hidden can be used just until it gets partially (or even worse fully) overwritten.

It does not read only the source operand and then do a partial write to the destination, even though the instruction description might lead you to believe that.

I am well aware that your company's documentation is misleading when it comes to that. I am also aware that Intel Architecture does not support three-operand instructions because of a microcode limitation. You can only fake the third operand by making it implicit be it the destination, or the one of the sources.

As for the memory operand I would personally prefer different intrinsic syntax and here is why:

__m128 somefunc0(float *a)
{
	return _mm_round_ss(_mm_setzero_ps(), _mm_load_ss(a), 3);
}

In the above example you need to specify two additional intrinsics to get the equivalent of the following assembler code:

	mov	eax, dword ptr [esp + 4]
	movss	xmm0, dword ptr [eax]
	roundss xmm0, xmm0, 3
	ret
...or if you prefer...
	mov	eax, dword ptr [esp + 4]
	xorps	xmm0, xmm0
	roundss xmm0, dword ptr [eax], 3
	ret

To someone with assembler background that seems counter-intuitive. Even though I am aware that adding those two intrinsics actually doesn't add two instructions, it just feels awkward. But that is just me I guess.

It is the job of the compiler to identify opportunities to use the memory instruction forms. The programmer should not have to worry about it.

It would also be nice then if you could write _mm_round_ss(0, a, 3) instead of _mm_round_ss(_mm_setzero_ps(), _mm_load_ss(a), 3) and let the compiler worry about it. I mean if you are already abstracting things and deviating from assembler syntax why stop just at changing number of operands? Why not allow numeric constants and memory operands less than __m128 as parameters?

My main gripe with those intrinsics seems to be the lack of consistency

David_K_Intel1 · ‎04-08-2008

Igor,

My apologies for the tardy response. Our discussion was interrupted bymy vacation.

Well in my opinion hidden input operand is not the same as real input operand. The difference between those two is that the real input operand can be reused as a source for multiple instructions, while hidden can be used just until it gets partially (or even worse fully) overwritten.

I disagree with this. There is no difference between what you call the "hidden input operand" and the "real input operand". The first operand to _mm_round_ss can be reused as a source for multiple instruction just like any other source operand, e.g.

x = _mm_round_ss(a, b, 3);

// x will be { round(b[0]), a[1], a[2], a[3] }

y = _mm_add_ps(a, b);

// y will be { a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3] }

The compiler will detect that there are multiple uses of a and insert the necessary copy.

As for the memory operand I would personally prefer different intrinsic syntax and here is why:

__m128 somefunc0(float *a)
{
	return _mm_round_ss(_mm_setzero_ps(), _mm_load_ss(a), 3);
}

The request for separate memory forms of the scalar intrinsics is reasonable and would eliminate the need for the _mm_load_ss (not the _mm_setzero_ps, though). The reason that approach was not taken is that it would effectively double the number of scalar intrinsics. They would all need reg-reg and reg-mem versions. By providing _mm_load_ss, we provide the full power of the instruction set without adding an excessive number of intrinsics.

If you feel strongly about this, you can always define your own reg-mem forms of the intrinsics using macros, e.g.

#define _mm_roundmem_ss(a, p, imm)

_mm_round_ss((a), _mm_load_ss(p), (imm))

My main gripe with those intrinsics seems to be the lack of consistency

I can personally attest to the fact that consistency is one of the primary goals when defining new intrinsics. I will also admit that the intrinsics aren't perfect in this regard. But I do not think _mm_round_ss is an example of an inconsistency. Other intrinsics like _mm_sqrt_sd, _mm_cvtsd_ss, _mm_cvtss_sd, etc.all behave the same correct way.

A few related intrinsics that ARE unfortunately inconsistent include _mm_sqrt_ss, _mm_rcp_ss, and _mm_rsqrt_ss. These all take a single input operand and define their result like this:

r = _mm_sqrt_ss(a);

r[0] = sqrt(a[0]);

r[1] = a[1];

r[2] = a[2];

r[3] = a[3];

That restricts the programmer and the compiler to ins truction forms like "sqrtss xmm0, xmm0".

David Kreitzer

levicki · ‎04-08-2008

David, no problem.

There is no difference between what you call the "hidden input operand" and the "real input operand".

I was contrasting _mm_round_ss() intrinsic with assembler instruction ROUNDSS. There is difference between those two, you cannot write ROUNDSS xmm1, xmm0 and reuse xmm1 in the next instruction because its input value gets overwritten with result. In assembler programmer must explicitly preserve hidden input operand (destination) and with intrinsic that is abstracted.

That is exactly what I dislike because in my opinion it is inconsistent with assembler.

If intrinsics are meant to abstract things then they are doing a poor job because you can't write _mm_round_ss(0, a, 3), if they are meant to mimic assembler as close as possible then they are again doing a poor job because they allow constructs which are not possible in assembler by use of hidden .

Of course, now that AVX info has gone public (I already read the manual and learned about real three-operand non-destructive syntax) I understand why those intrinsics are designed this way so this argument doesn't make much sense anymore with 2010 and Sandy Bridge in mind.