Performance of Intrinsics masked alignr

Patrick_S_ · ‎01-02-2014

Hey all, is there a performance difference between the masked instruction _mm512_mask_alignr_epi32 and the normal instruction _mm512_alignr_epi32 Or do they have the same latency and throughput? Thanks Patrick

Patrick_S_ · ‎01-02-2014

sry for the bad layout.. i forgot the line breaks.

Loc_N_Intel · ‎01-03-2014

Hi Patrick,

I didn't find the performance difference between __mm512_alignr_epi32 and __mm512_mask_alignr_epi32 in the compiler reference neither. Let's me investigate this and get back to you. Thank you.

Loc_N_Intel · ‎01-06-2014

Hi Patrick,

I wrote a simple program to verify the performance difference between the above intrinsic's. They don't seem to have any difference in performance at all. Hope this helps. Thank you.

Patrick_S_ · ‎01-06-2014

Hi Ioc,

thanks for that effort. Thats nice to know. Does this mean, that all other mask instructions have the same performance as their non-mask instruction? It seems to be the case for all arithmetics instructions, according to the Intel Intrinsics Guide.

I'm using the alignr instruction very often for performing transpose operations like this:

[cpp]

//input:

//a: 0.0 0.3 0.6 1.0 | 1.3 1.6 2.0 2.3 | 2.6 3.0 3.3 3.6 | 4.0 4.3 4.6 5.0

//d: 0.1 0.4 0.7 1.1 | 1.4 1.7 2.1 2.4 | 2.7 3.1 3.4 3.7 | 4.1 4.4 4.7 5.1

//g: 0.2 0.5 0.8 1.2 | 1.5 1.8 2.2 2.5 | 2.8 3.2 3.5 3.8 | 4.2 4.5 4.8 5.2

//needed output registers:

//t1: 0.0 0.1 0.2 1.0 | 1.1 1.2 2.0 2.1 | 2.2 3.0 3.1 3.2 | 4.0 4.1 4.2 5.0

__m512 a_, d_, g_, t1;

t1 = _mm512_castsi512_ps(

_mm512_mask_alignr_epi32( _mm512_castps_si512( a_ ) , 0x2492,

_mm512_castps_si512( d_ ), _mm512_castps_si512( d_ ), 15 ) );

t1 = _mm512_castsi512_ps(

_mm512_mask_alignr_epi32( _mm512_castps_si512( t1 ) , 0x4924,

_mm512_castps_si512( g_ ), _mm512_castps_si512( g_ ), 14 ) );

[/cpp]

I thought using the alignr Intrinsics might be better than using e.g. the swizzle operation, which has no corresponding assembly instruction. Or is there a better way for doing similar operations on floats?

Thanks

Patrick

Kevin_D_Intel · ‎01-09-2014

Our developer confirmed there should not be performance differences between mask and non-masked intinsics (which map to corresponding instructions with or without writemask register). He also indicated your use the alignr intrinsic looks good and could not think of a more efficient method than yours for doing that transpose.

Patrick_S_ · ‎01-09-2014

Thank you.