Software Archive
Read-only legacy content
17061 Discussions

Performance of Intrinsics masked alignr

Patrick_S_
New Contributor I
1,097 Views

Hey all, is there a performance difference between the masked instruction _mm512_mask_alignr_epi32 and the normal instruction _mm512_alignr_epi32 Or do they have the same latency and throughput? Thanks Patrick

0 Kudos
6 Replies
Patrick_S_
New Contributor I
1,097 Views

sry for the bad layout.. i forgot the line breaks.

0 Kudos
Loc_N_Intel
Employee
1,097 Views

Hi Patrick,

I didn't find the performance difference between __mm512_alignr_epi32 and __mm512_mask_alignr_epi32 in the compiler reference neither. Let's me investigate this and get back to you. Thank you.  

0 Kudos
Loc_N_Intel
Employee
1,097 Views

Hi Patrick,

I wrote a simple program to verify the performance difference between the above intrinsic's. They don't seem to have any difference in performance at all. Hope this helps. Thank you.

0 Kudos
Patrick_S_
New Contributor I
1,097 Views

Hi Ioc,

thanks for that effort. Thats nice to know. Does this mean, that all other mask instructions have the same performance as their non-mask instruction? It seems to be the case for all arithmetics instructions, according to the Intel Intrinsics Guide.

I'm using the alignr instruction very often for performing transpose operations like this:

[cpp]

//input:

//a: 0.0 0.3 0.6 1.0 | 1.3 1.6 2.0 2.3 | 2.6 3.0 3.3 3.6 | 4.0 4.3 4.6 5.0 

//d: 0.1 0.4 0.7 1.1 | 1.4 1.7 2.1 2.4 | 2.7 3.1 3.4 3.7 | 4.1 4.4 4.7 5.1

//g: 0.2 0.5 0.8 1.2 | 1.5 1.8 2.2 2.5 | 2.8 3.2 3.5 3.8 | 4.2 4.5 4.8 5.2

 

//needed output registers:

//t1: 0.0 0.1 0.2 1.0 | 1.1 1.2 2.0 2.1 | 2.2 3.0 3.1 3.2 | 4.0 4.1 4.2 5.0

 

__m512 a_,  d_,  g_, t1;

 

t1 = _mm512_castsi512_ps(

_mm512_mask_alignr_epi32( _mm512_castps_si512( a_ ) , 0x2492,

                          _mm512_castps_si512( d_ ), _mm512_castps_si512( d_ ), 15 ) );

 

t1 = _mm512_castsi512_ps(

_mm512_mask_alignr_epi32( _mm512_castps_si512( t1 ) , 0x4924,

                          _mm512_castps_si512( g_ ), _mm512_castps_si512( g_ ), 14 ) );

[/cpp]

I thought using the alignr Intrinsics might be better than using e.g. the swizzle operation, which has no corresponding assembly instruction. Or is there a better way for doing similar operations on floats?

 

Thanks

Patrick

0 Kudos
Kevin_D_Intel
Employee
1,097 Views

Our developer confirmed there should not be performance differences between mask and non-masked intinsics (which map to corresponding instructions with or without writemask register). He also indicated your use the alignr intrinsic looks good and could not think of a more efficient method than yours for doing that transpose.

0 Kudos
Patrick_S_
New Contributor I
1,097 Views

Thank you.

0 Kudos
Reply