- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey all, is there a performance difference between the masked instruction _mm512_mask_alignr_epi32 and the normal instruction _mm512_alignr_epi32 Or do they have the same latency and throughput? Thanks Patrick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sry for the bad layout.. i forgot the line breaks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Patrick,
I didn't find the performance difference between __mm512_alignr_epi32 and __mm512_mask_alignr_epi32 in the compiler reference neither. Let's me investigate this and get back to you. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Patrick,
I wrote a simple program to verify the performance difference between the above intrinsic's. They don't seem to have any difference in performance at all. Hope this helps. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ioc,
thanks for that effort. Thats nice to know. Does this mean, that all other mask instructions have the same performance as their non-mask instruction? It seems to be the case for all arithmetics instructions, according to the Intel Intrinsics Guide.
I'm using the alignr instruction very often for performing transpose operations like this:
[cpp]
//input:
//a: 0.0 0.3 0.6 1.0 | 1.3 1.6 2.0 2.3 | 2.6 3.0 3.3 3.6 | 4.0 4.3 4.6 5.0
//d: 0.1 0.4 0.7 1.1 | 1.4 1.7 2.1 2.4 | 2.7 3.1 3.4 3.7 | 4.1 4.4 4.7 5.1
//g: 0.2 0.5 0.8 1.2 | 1.5 1.8 2.2 2.5 | 2.8 3.2 3.5 3.8 | 4.2 4.5 4.8 5.2
//needed output registers:
//t1: 0.0 0.1 0.2 1.0 | 1.1 1.2 2.0 2.1 | 2.2 3.0 3.1 3.2 | 4.0 4.1 4.2 5.0
__m512 a_, d_, g_, t1;
t1 = _mm512_castsi512_ps(
_mm512_mask_alignr_epi32( _mm512_castps_si512( a_ ) , 0x2492,
_mm512_castps_si512( d_ ), _mm512_castps_si512( d_ ), 15 ) );
t1 = _mm512_castsi512_ps(
_mm512_mask_alignr_epi32( _mm512_castps_si512( t1 ) , 0x4924,
_mm512_castps_si512( g_ ), _mm512_castps_si512( g_ ), 14 ) );
[/cpp]
I thought using the alignr Intrinsics might be better than using e.g. the swizzle operation, which has no corresponding assembly instruction. Or is there a better way for doing similar operations on floats?
Thanks
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our developer confirmed there should not be performance differences between mask and non-masked intinsics (which map to corresponding instructions with or without writemask register). He also indicated your use the alignr intrinsic looks good and could not think of a more efficient method than yours for doing that transpose.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page