- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
VPERM2F128 and VSHUFPS together can implement the same functionality as a VBROADCAST from register. Do you know of applications or algorithms where that functionality, if somehow improved in performance, would provide significant benefits?
- Thai
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MADqle:
Do you know of applications or algorithms where that functionality, if somehow improved in performance, would provide significant benefits?
Yes indeed. Operations between a vector and a scalar are very common in vectorized code. For example, vector = vector + scalar or vector = vector * scalar. Such operations need to broadcast the scalar, which is more likely to be in a register than in memory.
Example:
F32vec8 operator * (const F32vec8 a, const float b) {
// All 8 elements of a multiplied by b
__asm {
VBROADCASTSS YMM1, XMM1
VMULPS YMM0, YMM1
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Example:
double a[4], b[4], c, d;
int i;
...
for (i=0; i<4; i++) a = a * b + c + d;
A vectorizing compiler would first compute c+d, broadcast this into a YMM register, then do a vector multiply and a vector add. Everything could be in register variables if there are enough registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just adding my 2 cents.
It's true that your suggested syntax would be more intuive from a programming perspective. Whereas the alternate sequence Thai suggested would require one extra instruction.
Taking a heuristic perspectiverelative to OOOmicroarchitecture, it seems likely the performance of either approach will have negligible difference. If the scalar data source is used in multiple iterations (such as the example you show), I suspect the cost of "scalar add + broadcast" or "scalar add + shuffle + permf128" will be effectively hidden by a good OOO.
So, from an ROI perspective, it's likely to be an extra investment of Si and architecture with no performance benefit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MADsjkuo:
If the scalar data source is used in multiple iterations (such as the example you show), I suspect the cost of "scalar add + broadcast" or "scalar add + shuffle + permf128" will be effectively hidden by a good OOO.
Only if the loop is large. The extra instruction incurs a cost if in a dependency chain or if instruction decoding is the bottleneck.
If the instruction is not sufficiently useful for a register source, then it is not sufficiently useful for a memory source either. What's the difference?
MADsjkuo:
So, from an ROI perspective, it's likely to be an extra investment of Si and architecture with no performance benefit.
I would expect it to be cheaper in terms of silicon to follow the same pattern as almost all other instructions, rather than making an unnecessary limitation for one specific instruction.
It is an extra complication to programmers and to compiler makers that they have to use different instructions depending on whether the scalar is a register variable or a memory variable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Only if the loop is large. The extra instruction incurs a cost if in a dependency chain or if instruction decoding is the bottleneck.
That's one situation where the cost of an extra instruction would be amortized. But the mental picture I was referring to is more along the line:
A dependency chain is likely to start with some kind of load operations. Except for special cases of pointer chasing or store forwarding type of stuff, when a dependency chains start with a load from a memory address that can be disambiguated, some portion of the cost of most dependency chains are hidden to some degree.With load operations frequently are 30% in the instruction composition (or 1 in 3). That implies having a long, tight, monolithic chain of register-only dependency chain would be highly unusual. Loads that hit cache tend to have the benevolent effect of helping OOO hide the cost of dependency fragments.
Furthermore, if the register result is to be used across multiple iterations of the loop, and that register result is part of some other dependency chain, subsequent iterations where the register results were re-used are not really in a continuous dependency chain. Re-use is another situation that should help the OOO as well.
So, I tend to think even in short loopy situation where the register result you want to broadcast is part of one dependency chain, the cost only occurs for one iteration of the loop.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page