Solved: Quote:King Crimson wrote:

King_Crimson · ‎12-11-2014

Dear forum,

Here is a simple pseudo-code I'm trying to implement
1. load 8 float64 numbers from addr to create vec_a
2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
1._mm512_load_pd
2. another _mm512_load_pd
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector
4. _mm512_fmadd_pd

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.
I appreciate any advice.

Alastair_M_ · ‎12-15-2014

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement
1. load 8 float64 numbers from addr to create vec_a
2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
1._mm512_load_pd
2. another _mm512_load_pd
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector
4. _mm512_fmadd_pd

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.
I appreciate any advice.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly. I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h>

const __m512i broadcast = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};
const __m512d test = {0.1,0.2,0.3,0.4,0.5,0.6,0.7};

int main()
{
        __m512d a = _mm512_castsi512_pd(_mm512_permutevar_epi32(broadcast,_mm512_castpd_si512(test)));

}

Best regards,

Alastair

View solution in original post

Alastair_M_ · ‎12-15-2014

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement
1. load 8 float64 numbers from addr to create vec_a
2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
1._mm512_load_pd
2. another _mm512_load_pd
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector
4. _mm512_fmadd_pd

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.
I appreciate any advice.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly. I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h>

const __m512i broadcast = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};
const __m512d test = {0.1,0.2,0.3,0.4,0.5,0.6,0.7};

int main()
{
        __m512d a = _mm512_castsi512_pd(_mm512_permutevar_epi32(broadcast,_mm512_castpd_si512(test)));

}

Best regards,

Alastair

question on intrinsics