- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As far as I can tell, there is no easy way to shift a 128-bit value in XMM registers, either left or right, by a bit offset. A multiple instruction sequence using byte shifts, bit shifts in the other direction and ORs is possible but considerably less efficient.
It seems like a minor change to make PS{L,R}LDQ and possibly PALIGNR considerably more functional - these are currently underutilizing the imm8 field at any rate. Maybe for SSE6? :-)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Geoff,
One of our engineers provided this response,along witha request for clarification:
The question is correct, that it is hard to do SIMD bit shifts, rather than byte wise shifts with the current instruction set. Unfortunately, it is not a "minor change" to introduce an instruction to do such bitwise shifts. There is much more to the change than simply fitting the shift distance into the immediate byte -- the hardware to actually accomplish the bit shift is the limiting issue.
Ifyou have a use case as to why the operation is useful, along with the application that would benefit from the operation, that would be interesting to hear. In general, we try to design new instructions to serve specific needs, rather than to just supply "missing" instructions. From a practical point of view, there are many such "missing" instructions -- the more interesting question is how useful that missing instruction is for a real application.
==
Lexi S.
IntelSoftware NetworkSupport
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lexi, while we are at it, is there any chance that we will finally get gather/scatter SIMD instructions at least for 32-bit int/float datatypes? Those would be usefull for so many things because they would reduce the pressure on GPRs for address calculation.
I would make them as follows:
GMOVPS xmmreg, xmmreg/mem128, reg32, imm8 ; gather SMOVPS xmmreg/mem128, xmmreg, reg32, imm8 ; scatter xmmreg for scatter contains four 32-bit floats to be writen out for gather it receives values which are read from memory xmmreg/mem128 contains four offsets from base pointer reg32 GPR containing base address pointer imm8 shuffle value like for SHUFPS Fictive code example: lea esi, dataset lea edx, [esi + rowsize] lea eax, destination lea edi, offset_table mov ecx, dword ptr [count] loop: movdqa xmm1, xmmword ptr [edi] movdqa xmm3, xmmword ptr [edi + 16] gmovps xmm0, xmm1, esi, 0xDD gmovps xmm2, xmm3, edx, 0x88 movaps xmm6, xmmword ptr [eax] subps xmm2, xmm0 mulps xmm2, xmm7 addps xmm2, xmm0 addps xmm2, xmm6 movaps xmmword ptr [eax], xmm0 add esi, 16 add edx, 16 add eax, 16 add edi, 32 sub ecx, 1 jnz loop
NOTE: I know that the above loop could be written much better (perhaps using single register as index, etc) but it is just an example off the top of my head. Without an instruction such as gmovps one has to perform anywhere between eight and four loads, two shuffles and a bunch of GPR pointer math to get a vector from scattered data. I cannot be 100% sure that it would be faster, hopefully someone in Intel can test it in some simulator.
Another thing I always wanted to have is FRACPS, that could help your compiler with some non-vectorizable loops and is generaly very usefull.
fracps xmm1, xmm0 This would simply do the following as one operation movaps xmm1, xmm0 cvttps2dq xmm0, xmm0 cvtdq2ps xmm0, xmm0 subps xmm1, xmm0 ; xmm1 has fractional part You could even add an imm8 parameter and use roundps instead of cvttps2dq. One step beyond that would be fipsdq (frac-int-ps-dq). fipsdq xmm2, xmm1, xmm0 (could be implicit source) movaps xmm1, xmm0 cvttps2dq xmm2, xmm0 ; xmm2 has integer part cvtdq2ps xmm0, xmm2 subps xmm1, xmm0 ; xmm1 has fractional part
In my opinion those two instructions I just proposed would have great chances of working much faster than those the above code since it probably wouldn't require full conversions float->int->float.
This would bring considerable speedup for interpolation where you need to separate integer and fractional part for indexing and multiplication. Of course, in both versions it would be nice if xmm0 (source) doesn't get trashed so it can be reused. Again, code is just an example, not exactly what it should be.
So Lexi, could you please pass this to proper department so they consider it for some next SIMD set?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lexi, while we are at it, is there any chance that we will finally get gather/scatter SIMD instructions at least for 32-bit int/float datatypes? Those would be usefull for so many things because they would reduce the pressure on GPRs for address calculation.
I would make them as follows:
GMOVPS xmmreg, xmmreg/mem128, reg32, imm8 ; gather SMOVPS xmmreg/mem128, xmmreg, reg32, imm8 ; scatter xmmreg for scatter contains four 32-bit floats to be writen out for gather it receives values which are read from memory xmmreg/mem128 contains four offsets from base pointer reg32 GPR containing base address pointer imm8 shuffle value like for SHUFPS Fictive code example: lea esi, dataset lea edx, [esi + rowsize] lea eax, destination lea edi, offset_table mov ecx, dword ptr [count] loop: movdqa xmm1, xmmword ptr [edi] movdqa xmm3, xmmword ptr [edi + 16] gmovps xmm0, xmm1, esi, 0xDD gmovps xmm2, xmm3, edx, 0x88 movaps xmm6, xmmword ptr [eax] subps xmm2, xmm0 mulps xmm2, xmm7 addps xmm2, xmm0 addps xmm2, xmm6 movaps xmmword ptr [eax], xmm0 add esi, 16 add edx, 16 add eax, 16 add edi, 32 sub ecx, 1 jnz loop
NOTE: I know that the above loop could be written much better (perhaps using single register as index, etc) but it is just an example off the top of my head. Without an instruction such as gmovps one has to perform anywhere between eight and four loads, two shuffles and a bunch of GPR pointer math to get a vector from scattered data. I cannot be 100% sure that it would be faster, hopefully someone in Intel can test it in some simulator.
Another thing I always wanted to have is FRACPS, that could help your compiler with some non-vectorizable loops and is generaly very usefull.
fracps xmm1, xmm0 This would simply do the following as one operation movaps xmm1, xmm0 cvttps2dq xmm0, xmm0 cvtdq2ps xmm0, xmm0 subps xmm1, xmm0 ; xmm1 has fractional part You could even add an imm8 parameter and use roundps instead of cvttps2dq. One step beyond that would be fipsdq (frac-int-ps-dq). fipsdq xmm2, xmm1, xmm0 (could be implicit source) movaps xmm1, xmm0 cvttps2dq xmm2, xmm0 ; xmm2 has integer part cvtdq2ps xmm0, xmm2 subps xmm1, xmm0 ; xmm1 has fractional part
In my opinion those two instructions I just proposed would have great chances of working much faster than those the above code since it probably wouldn't require full conversions float->int->float.
This would bring considerable speedup for interpolation where you need to separate integer and fractional part for indexing and multiplication. Of course, in both versions it would be nice if xmm0 (source) doesn't get trashed so it can be reused. Again, code is just an example, not exactly what it should be.
So Lexi, could you please pass this to proper department so they consider it for some next SIMD set?
Yes I would have to a agree a gather/scatter instruction would allow many codes to vectorize easily
that currently do not.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page