Missing instruction in SSE: PSLLDQ with _bit_ shift amount?

geofflangdale · ‎02-28-2008

Is there a good reason that the PS{L,R}LDQ take a 'byte shift' argument rather than a bit shift argument?

As far as I can tell, there is no easy way to shift a 128-bit value in XMM registers, either left or right, by a bit offset. A multiple instruction sequence using byte shifts, bit shifts in the other direction and ORs is possible but considerably less efficient.

It seems like a minor change to make PS{L,R}LDQ and possibly PALIGNR considerably more functional - these are currently underutilizing the imm8 field at any rate. Maybe for SSE6? :-)

Intel_Software_Netw1 · ‎02-29-2008

Hi Geoff,

One of our engineers provided this response,along witha request for clarification:

The question is correct, that it is hard to do SIMD bit shifts, rather than byte wise shifts with the current instruction set. Unfortunately, it is not a "minor change" to introduce an instruction to do such bitwise shifts. There is much more to the change than simply fitting the shift distance into the immediate byte -- the hardware to actually accomplish the bit shift is the limiting issue.

Ifyou have a use case as to why the operation is useful, along with the application that would benefit from the operation, that would be interesting to hear. In general, we try to design new instructions to serve specific needs, rather than to just supply "missing" instructions. From a practical point of view, there are many such "missing" instructions -- the more interesting question is how useful that missing instruction is for a real application.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

geofflangdale · ‎02-29-2008

Unfortunately, I'm not at liberty to post the details of our use case on a public forum. I would be glad to take this to email: please ask your contact me at geoff.langdale AT GMAIL DOT COM and I will explain the application area and the justification for the bit shift.

levicki · ‎03-12-2008

Lexi, while we are at it, is there any chance that we will finally get gather/scatter SIMD instructions at least for 32-bit int/float datatypes? Those would be usefull for so many things because they would reduce the pressure on GPRs for address calculation.

I would make them as follows:

GMOVPS	xmmreg, xmmreg/mem128, reg32, imm8 ; gather
SMOVPS	xmmreg/mem128, xmmreg, reg32, imm8 ; scatter

xmmreg	for scatter contains four 32-bit floats to be writen out
	for gather it receives values which are read from memory

xmmreg/mem128	contains four offsets from base pointer

reg32	GPR containing base address pointer

imm8	shuffle value like for SHUFPS

Fictive code example:

	lea	esi, dataset
	lea	edx, [esi + rowsize]
	lea	eax, destination
	lea	edi, offset_table
	mov	ecx, dword ptr [count]
loop:
	movdqa	xmm1, xmmword ptr [edi]
	movdqa	xmm3, xmmword ptr [edi + 16]
	gmovps	xmm0, xmm1, esi, 0xDD
	gmovps	xmm2, xmm3, edx, 0x88
	movaps	xmm6, xmmword ptr [eax]
	subps	xmm2, xmm0
	mulps	xmm2, xmm7
	addps	xmm2, xmm0
	addps	xmm2, xmm6
	movaps	xmmword ptr [eax], xmm0
	add	esi, 16
	add	edx, 16
	add	eax, 16
	add	edi, 32
	sub	ecx, 1
	jnz	loop

NOTE: I know that the above loop could be written much better (perhaps using single register as index, etc) but it is just an example off the top of my head. Without an instruction such as gmovps one has to perform anywhere between eight and four loads, two shuffles and a bunch of GPR pointer math to get a vector from scattered data. I cannot be 100% sure that it would be faster, hopefully someone in Intel can test it in some simulator.

Another thing I always wanted to have is FRACPS, that could help your compiler with some non-vectorizable loops and is generaly very usefull.

	fracps		xmm1, xmm0

This would simply do the following as one operation

	movaps		xmm1, xmm0
	cvttps2dq	xmm0, xmm0
	cvtdq2ps	xmm0, xmm0
	subps		xmm1, xmm0 ; xmm1 has fractional part

You could even add an imm8 parameter and use roundps instead of cvttps2dq.

One step beyond that would be fipsdq (frac-int-ps-dq).

	fipsdq		xmm2, xmm1, xmm0 (could be implicit source)

	movaps		xmm1, xmm0
	cvttps2dq	xmm2, xmm0 ; xmm2 has integer part
	cvtdq2ps	xmm0, xmm2
	subps		xmm1, xmm0 ; xmm1 has fractional part

In my opinion those two instructions I just proposed would have great chances of working much faster than those the above code since it probably wouldn't require full conversions float->int->float.

This would bring considerable speedup for interpolation where you need to separate integer and fractional part for indexing and multiplication. Of course, in both versions it would be nice if xmm0 (source) doesn't get trashed so it can be reused. Again, code is just an example, not exactly what it should be.

So Lexi, could you please pass this to proper department so they consider it for some next SIMD set?

happyIntelCamper · ‎07-14-2009

Quoting - Igor Levicki

Lexi, while we are at it, is there any chance that we will finally get gather/scatter SIMD instructions at least for 32-bit int/float datatypes? Those would be usefull for so many things because they would reduce the pressure on GPRs for address calculation.

I would make them as follows:

GMOVPS	xmmreg, xmmreg/mem128, reg32, imm8 ; gather
SMOVPS	xmmreg/mem128, xmmreg, reg32, imm8 ; scatter

xmmreg	for scatter contains four 32-bit floats to be writen out
	for gather it receives values which are read from memory

xmmreg/mem128	contains four offsets from base pointer

reg32	GPR containing base address pointer

imm8	shuffle value like for SHUFPS

Fictive code example:

	lea	esi, dataset
	lea	edx, [esi + rowsize]
	lea	eax, destination
	lea	edi, offset_table
	mov	ecx, dword ptr [count]
loop:
	movdqa	xmm1, xmmword ptr [edi]
	movdqa	xmm3, xmmword ptr [edi + 16]
	gmovps	xmm0, xmm1, esi, 0xDD
	gmovps	xmm2, xmm3, edx, 0x88
	movaps	xmm6, xmmword ptr [eax]
	subps	xmm2, xmm0
	mulps	xmm2, xmm7
	addps	xmm2, xmm0
	addps	xmm2, xmm6
	movaps	xmmword ptr [eax], xmm0
	add	esi, 16
	add	edx, 16
	add	eax, 16
	add	edi, 32
	sub	ecx, 1
	jnz	loop

NOTE: I know that the above loop could be written much better (perhaps using single register as index, etc) but it is just an example off the top of my head. Without an instruction such as gmovps one has to perform anywhere between eight and four loads, two shuffles and a bunch of GPR pointer math to get a vector from scattered data. I cannot be 100% sure that it would be faster, hopefully someone in Intel can test it in some simulator.

Another thing I always wanted to have is FRACPS, that could help your compiler with some non-vectorizable loops and is generaly very usefull.

	fracps		xmm1, xmm0

This would simply do the following as one operation

	movaps		xmm1, xmm0
	cvttps2dq	xmm0, xmm0
	cvtdq2ps	xmm0, xmm0
	subps		xmm1, xmm0 ; xmm1 has fractional part

You could even add an imm8 parameter and use roundps instead of cvttps2dq.

One step beyond that would be fipsdq (frac-int-ps-dq).

	fipsdq		xmm2, xmm1, xmm0 (could be implicit source)

	movaps		xmm1, xmm0
	cvttps2dq	xmm2, xmm0 ; xmm2 has integer part
	cvtdq2ps	xmm0, xmm2
	subps		xmm1, xmm0 ; xmm1 has fractional part

In my opinion those two instructions I just proposed would have great chances of working much faster than those the above code since it probably wouldn't require full conversions float->int->float.

This would bring considerable speedup for interpolation where you need to separate integer and fractional part for indexing and multiplication. Of course, in both versions it would be nice if xmm0 (source) doesn't get trashed so it can be reused. Again, code is just an example, not exactly what it should be.

So Lexi, could you please pass this to proper department so they consider it for some next SIMD set?

Yes I would have to a agree a gather/scatter instruction would allow many codes to vectorize easily
that currently do not.