Idea for a new SIMD instruction

levicki · ‎01-05-2007

This SSE instruction should gather 32-bit integer or single precision float data to destination XMM register from four different memory locations. Pointers to those locations could be stored either in memory or in another XMM register as 32-bit integers.

In 64-bit mode it could use RSI register as a base address and the values from XMM register or from the memory could then be used as 32-bit offsets from base address in RSI.

Such an instruction would be most usefull for interpolation and in most cases it would have to gather adjacent or even overlapping values from memory so various optimizations could be possible internally.

Could someone pass this idea to the CPU development team?

TimP · ‎01-07-2007

We have proposed a gather instruction from time to time, and it comes up regularly for reconsideration. Many of us don't care, unless it can be used to overcome performance obstacles. If it does get accepted, it may not be announced until it becomes available in a compiler which supports production hardware.
A gather instruction which does nothing to improve performance does not have the appeal of certain past additions to SSE instruction set, which were used primarily to prevent code generated with new compiler options from running on older hardware.

levicki · ‎01-08-2007

Gather instruction would help automatic vectorization for algorithms such as interpolation, raytracing and physics processing.

Surely it would do more help than some of the recently added instructions (movddup comes to mind as a completely useless thing accomplishable otherwise using shuffle).

What irritates me the most in your answer however is the part where you say "Many of us don't care" so arrogantly as if you are the voice of God. Guess what? I don't care if any of you (whoever you might be) care or not! I love optimized code and I just adore usefull instructions and not stupid and redundant ones.

And surely it would be more convenient to write:

	mov		esi, dword ptr [pix] ; base
	gmovps		xmm0, xmmword ptr [ip] ; ip

Instead of:

	mov		esi, dword ptr [pix] ; base
	mov		eax, dword ptr [ip] ; offset
	movd		xmm0, [esi + eax]
	mov		edx, dword ptr [ip + 4]
	movd		xmm1, [esi + edx]
	unpcklps	xmm0, xmm1
	mov		eax, dword ptr [ip + 8]
	movd		xmm2, [esi + eax]
	mov		edx, dword ptr [ip + 12]
	movd		xmm3, [esi + edx]
	unpcklps	xmm2, xmm3
	movhps		xmm0, xmm2

?

Intel_C_Intel · ‎01-08-2007

Dear Igor,

Thank you for your suggestion. Please rest assured that efficient gather and scatter instructions were already on the wish-list of many Intel engineers, since these would increase the scope of (automatic) vectorization substantially. Topics of discussion that always arise are instruction orthogonality (32-bit float data and 32-bit indices work well, but what about all other combinations of the data width vs. the index width?) and efficient micro-architectural implementations. The CPU development teams are aware of the usefulness of these instructions and hopefully a satisfactory implementation will eventually find its way to our desktops!

Aart Bik

http://www.aartbik.com/

levicki · ‎01-22-2007

Hello Aart,

I thought it out thoroughly and I was hoping that it could be added to the list of Penryn SSE4 extensions, or at least for Nehalem if Penryn is already "sealed" (I heard it is complete). It would really be usefull for all sorts of things.

As for orthogonality here is what I had in mind:

gmovps:
1. 32-bit floats, 32-bit pointers in 32-bit mode
2. 32-bit floats, 32-bit pointer offsets from RSI or RDI gives 2GB offsets, more than enough, even more than it makes sense for such an operation because of data locality.

gmovpd:
1. 64-bit floats, 32-bit pointers in 32-bit mode in 2nd and 0th DWORD
2. 64-bit floats, 64-bit pointers in 64-bit mode

Both instruction forms could have either an additional immediate or general purpose register operand for shuffling (like pshufd) in case you need to change order of floats/doubles.

Of course, the names are just a suggestion. I believe that would be extremely easy to implement and I really wonder why it is not already part of the instruction set instead of some of the redundant SSE3 instructions.