- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear forum,

Here is a simple pseudo-code I'm trying to implement

*1. load 8 float64 numbers from addr to create vec_a
2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a*

*+ f * vec_b*

*, i=1...7*My implementation of the above is

*1._mm512_load_pd
2. another _mm512_load_pd
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector
4. _mm512_fmadd_pd*

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

I appreciate any advice.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement

1. load 8 float64 numbers from addr to create vec_a

2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b

3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)

4. perform element by element operation vec_a+ f * vec_b, i=1...7My implementation of the above is

1._mm512_load_pd

2. another _mm512_load_pd

3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector

4. _mm512_fmadd_pdThe performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

I appreciate any advice.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly. I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h> const __m512i broadcast = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1}; const __m512d test = {0.1,0.2,0.3,0.4,0.5,0.6,0.7}; int main() { __m512d a = _mm512_castsi512_pd(_mm512_permutevar_epi32(broadcast,_mm512_castpd_si512(test))); }

Best regards,

Alastair

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement

1. load 8 float64 numbers from addr to create vec_a

2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b

3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)

4. perform element by element operation vec_a+ f * vec_b, i=1...7My implementation of the above is

1._mm512_load_pd

2. another _mm512_load_pd

3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector

4. _mm512_fmadd_pdThe performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

I appreciate any advice.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly. I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h> const __m512i broadcast = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1}; const __m512d test = {0.1,0.2,0.3,0.4,0.5,0.6,0.7}; int main() { __m512d a = _mm512_castsi512_pd(_mm512_permutevar_epi32(broadcast,_mm512_castpd_si512(test))); }

Best regards,

Alastair

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page