I have yet to find the optimal way. There are two and half solution, none of them are flawless.
1. vbroadcasti128 ymm0, qword ptr [...]
Second operand is a memory type, perfect for loading global constants while saving 16 bytes, but not if you already have the source in a xmm register. Why there is no register to register form? Even the intrinsic takes a value type, which the compiler has to save first, to reload, crazy.
2. cast xmm0 to ymm0, vinserti128 ymm0, ymm0, xmm0, 1
That looks to be the obvious choice, but when you think about it, it's a RAW dependency, whatever you were doing with the register has to be computed first to execute the insert. There is an option to use a second register, which you can vpxor with itself first and insert into that one twice, to both lanes, not sure if it's worth it.
3. cast xmm0 to ymm0, vperm2i128 ymm0, ymm0, ymm0, 0
Unless there is some kind of smart checking for this instruction, whether it overwrites both lanes, this is also a dependent one, same problem as with vinserti128.
On a second thought, insert/permute uses the same register as the source, so it has to wait for the result anyway. Which one would you recommend to use? Can they execute parallel at different ports?
Even the intrinsic takes a value type, which the compiler has to save first, to reload, crazy.
I just compiled a very simple example using _mm256_broadcastsi128_si256 with the latest Intel compiler and it looks better than what you describe (which compiler are you using btw?) but the result is quite strange (*):
vinserti128 ymm2, ymm1, xmm1, 1
it looks like a sensible choice since ymm2 is completely overwritten by this call (no partial register update issue)
* it's strange since according to the Intrinsics Guide (v2.8.1) this intrinsic should generate the vbroadcasti128 instruction