- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.
Here is what I found so far:
AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway
AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.
There are lots of multi-instruction workarounds I could think of, but I hoped that the Intel engineers have a smart trick for this basic operation which I overlooked?
Thanks,
Elmar
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
thanks, but vpbroadcastd fills the entire vector, I want to insert a single dword at a given location (like vpinsrd), and I want to do that fast, without consuming an extra temporary register (e.g. if I combine a vpbroadcastd with a vpblendd, that's a workaround that needs an extra register).
CU,
Elmar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sergey Kostrov: These are multi-instruction constructs, which basically come down to broadcasts or moves+shuffles. And the OP seem to want to inject a single dword into an existing register filled with data.
I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Elmar wrote:
AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway
simply use vinsertps followed by vinsertf128, this is the fastest available option AFAIK, I use it for my AVX legacy generic gather path detailed here for example: http://software.intel.com/en-us/comment/reply/285867/1740679
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.
But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?
For inserting a DWORD, I currently use vinsertps or vpermilps to place the DWORD at the right spot in an unused register, and then vblendps to move the DWORD into the target register (note that vblendps takes an immediate blend factor, not a mask register). If the DWORD crosses a lane, I need a third instruction for the cross-lane-shuffle.
I had hoped that Intel engineers would immediately fire the optimal solution at me (in terms of false dependencies, latency etc.), but it seems that they are busy (hopefully cleaning up the AVX2 manual #319433-014, because that's full of bugs ;-))...
Thanks,
Elmar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?
You're right, sorry for the confusion. It seems, inserts and blends are the way to go.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page