Software Archive
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
17065 Discussions

## question on intrinsics

Beginner
166 Views

Dear forum,

Here is a simple pseudo-code I'm trying to implement
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

1 Solution
New Contributor I
166 Views

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly.  I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h>

const __m512d test = {0.1,0.2,0.3,0.4,0.5,0.6,0.7};

int main()
{

}

Best regards,

Alastair

New Contributor I
167 Views

King Crimson wrote:

Dear forum,

Here is a simple pseudo-code I'm trying to implement
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a + f * vec_b, i=1...7

My implementation of the above is
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.

Hi King Crimson,

Have you looked at _mm512_extload_pd() ? You should be able to set the broadcast enun to _MM_BROADCAST1x8 and I think it will achieve what you want if I understand correctly.  I think that your next _mm512_load_pd() on the same address will hit the cache.

Alternatively I think you can you _mm512_permutevar_epi32 to broadcast the first 64 bit element to all other lanes like:

#include <immintrin.h>