- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
With 4 packed float (__m128), I can use the SSE intrinsic
__m128 H = _mm_shuffle_ps(X,X,_MM_SHUFFLE(3,3,3,3));
to set all elements ofHto the third element ofX(is this the fastest way?)
Now, I want to do the same with AVX and 4 packed double (__m256d). I naively coded
__m256d H = _mm256_shuffle_pd(X,X,_MM_SHUFFLE(3,3,3,3));
but this doesn't do the right thing! Instead it sets H={X[1],X[1],X[3],X[3]}.
So, how to do it right?
PS. using Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Link kopiert
2 Antworten
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I think you need three instructions with AVX, and let
H0 = _mm256_shuffle_pd(X, X, 0); H1 = _mm256_shuffle_pd( X, X, 0x11);
_mm256_permute2f128_pd(H1, H0, mask); set mask to 0x0, 0x11, 0x22, 0x33
Or if you have the 4 DP values in memory, you can use AVX.128 version of vmovdup to load one DP value into a __m128d, use the typecasting intrinsic wrapper to promote it as __m256d and stick into vperm2f128 with mask 0x0. That would be two instructions if the data in memory. If data is in register, one extera instruction may still be needed to make it 3.
With AVX2, one instruction will do.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Shouldn't two instructions do it instead of three? If you want to broadcast one value to all four, only 1 shuffle is needed to broadcast the value in one lane. Then, you copy this lane to the other lane using _mm256_permute2f128_pd as suggested. (For copying the lower to the upper lane, _mm256_broadcast_pd might be a simpler alternative.)

Antworten
Themen-Optionen
- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite