AVX VBROADCAST instructions. Why is register operand not allowed?

AFog0 · ‎07-14-2008

The VBROADCAST instructions allow only memory operands, not register operands, as source, according to "Intel Advanced Vector Extensions Programming Reference". I think that register operands would be no less useful. Is there a reason for this limitation?

Quoc-Thai_L_Intel · ‎08-01-2008

Hello,

VPERM2F128 and VSHUFPS together can implement the same functionality as a VBROADCAST from register. Do you know of applications or algorithms where that functionality, if somehow improved in performance, would provide significant benefits?

- Thai

AFog0 · ‎08-01-2008

MADqle:
Do you know of applications or algorithms where that functionality, if somehow improved in performance, would provide significant benefits?

Yes indeed. Operations between a vector and a scalar are very common in vectorized code. For example, vector = vector + scalar or vector = vector * scalar. Such operations need to broadcast the scalar, which is more likely to be in a register than in memory.

Example:

F32vec8 operator * (const F32vec8 a, const float b) {
   // All 8 elements of a multiplied by b
   __asm {
      VBROADCASTSS YMM1, XMM1
      VMULPS YMM0, YMM1
   }
}

AFog0 · ‎08-02-2008

Maybe my point is more clear if we talk about automatic vectorization.
Example:

double a[4], b[4], c, d;
int i;
...
for (i=0; i<4; i++) a = a * b + c + d;

A vectorizing compiler would first compute c+d, broadcast this into a YMM register, then do a vector multiply and a vector add. Everything could be in register variables if there are enough registers.

SHIH_K_Intel · ‎08-02-2008

Just adding my 2 cents.

It's true that your suggested syntax would be more intuive from a programming perspective. Whereas the alternate sequence Thai suggested would require one extra instruction.

Taking a heuristic perspectiverelative to OOOmicroarchitecture, it seems likely the performance of either approach will have negligible difference. If the scalar data source is used in multiple iterations (such as the example you show), I suspect the cost of "scalar add + broadcast" or "scalar add + shuffle + permf128" will be effectively hidden by a good OOO.

So, from an ROI perspective, it's likely to be an extra investment of Si and architecture with no performance benefit.

AFog0 · ‎08-06-2008

MADsjkuo:

If the scalar data source is used in multiple iterations (such as the example you show), I suspect the cost of "scalar add + broadcast" or "scalar add + shuffle + permf128" will be effectively hidden by a good OOO.

Only if the loop is large. The extra instruction incurs a cost if in a dependency chain or if instruction decoding is the bottleneck.

If the instruction is not sufficiently useful for a register source, then it is not sufficiently useful for a memory source either. What's the difference?

MADsjkuo:

So, from an ROI perspective, it's likely to be an extra investment of Si and architecture with no performance benefit.

I would expect it to be cheaper in terms of silicon to follow the same pattern as almost all other instructions, rather than making an unnecessary limitation for one specific instruction.

It is an extra complication to programmers and to compiler makers that they have to use different instructions depending on whether the scalar is a register variable or a memory variable.

SHIH_K_Intel · ‎08-06-2008

Only if the loop is large. The extra instruction incurs a cost if in a dependency chain or if instruction decoding is the bottleneck.

That's one situation where the cost of an extra instruction would be amortized. But the mental picture I was referring to is more along the line:

A dependency chain is likely to start with some kind of load operations. Except for special cases of pointer chasing or store forwarding type of stuff, when a dependency chains start with a load from a memory address that can be disambiguated, some portion of the cost of most dependency chains are hidden to some degree.With load operations frequently are 30% in the instruction composition (or 1 in 3). That implies having a long, tight, monolithic chain of register-only dependency chain would be highly unusual. Loads that hit cache tend to have the benevolent effect of helping OOO hide the cost of dependency fragments.

Furthermore, if the register result is to be used across multiple iterations of the loop, and that register result is part of some other dependency chain, subsequent iterations where the register results were re-used are not really in a continuous dependency chain. Re-use is another situation that should help the OOO as well.

So, I tend to think even in short loopy situation where the register result you want to broadcast is part of one dependency chain, the cost only occurs for one iteration of the loop.