- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This SSE instruction should gather 32-bit integer or single precision float data to destination XMM register from four different memory locations. Pointers to those locations could be stored either in memory or in another XMM register as 32-bit integers.
In 64-bit mode it could use RSI register as a base address and the values from XMM register or from the memory could then be used as 32-bit offsets from base address in RSI.
Such an instruction would be most usefull for interpolation and in most cases it would have to gather adjacent or even overlapping values from memory so various optimizations could be possible internally.
Could someone pass this idea to the CPU development team?Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A gather instruction which does nothing to improve performance does not have the appeal of certain past additions to SSE instruction set, which were used primarily to prevent code generated with new compiler options from running on older hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gather instruction would help automatic vectorization for algorithms such as interpolation, raytracing and physics processing.
Surely it would do more help than some of the recently added instructions (movddup comes to mind as a completely useless thing accomplishable otherwise using shuffle).
What irritates me the most in your answer however is the part where you say "Many of us don't care" so arrogantly as if you are the voice of God. Guess what? I don't care if any of you (whoever you might be) care or not! I love optimized code and I just adore usefull instructions and not stupid and redundant ones.
And surely it would be more convenient to write:
mov esi, dword ptr [pix] ; base gmovps xmm0, xmmword ptr [ip] ; ip
Instead of:
mov esi, dword ptr [pix] ; base mov eax, dword ptr [ip] ; offset movd xmm0, [esi + eax] mov edx, dword ptr [ip + 4] movd xmm1, [esi + edx] unpcklps xmm0, xmm1 mov eax, dword ptr [ip + 8] movd xmm2, [esi + eax] mov edx, dword ptr [ip + 12] movd xmm3, [esi + edx] unpcklps xmm2, xmm3 movhps xmm0, xmm2?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Igor,
Thank you for your suggestion. Please rest assured that efficient gather and scatter instructions were already on the wish-list of many Intel engineers, since these would increase the scope of (automatic) vectorization substantially. Topics of discussion that always arise are instruction orthogonality (32-bit float data and 32-bit indices work well, but what about all other combinations of the data width vs. the index width?) and efficient micro-architectural implementations. The CPU development teams are aware of the usefulness of these instructions and hopefully a satisfactory implementation will eventually find its way to our desktops!
Aart Bik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought it out thoroughly and I was hoping that it could be added to the list of Penryn SSE4 extensions, or at least for Nehalem if Penryn is already "sealed" (I heard it is complete). It would really be usefull for all sorts of things.
As for orthogonality here is what I had in mind:
gmovps:1. 32-bit floats, 32-bit pointers in 32-bit mode
2. 32-bit floats, 32-bit pointer offsets from RSI or RDI gives 2GB offsets, more than enough, even more than it makes sense for such an operation because of data locality.
gmovpd:
1. 64-bit floats, 32-bit pointers in 32-bit mode in 2nd and 0th DWORD
2. 64-bit floats, 64-bit pointers in 64-bit mode
Both instruction forms could have either an additional immediate or general purpose register operand for shuffling (like pshufd) in case you need to change order of floats/doubles.
Of course, the names are just a suggestion. I believe that would be extremely easy to implement and I really wonder why it is not already part of the instruction set instead of some of the redundant SSE3 instructions.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page