I need to load data with strides in AVX512 vector register. What is the best way to do this.
Lets suppose the stride is of 1000 and I need to load data at index 0,1*1000, 2*1000, 3*1000, 4*1000, 5*1000 , 6*1000 and 7*1000 in one AVX512 vector register.
What is the fastest way to do this. Which intrinsic should be used to do this. Data is double precision floating point numbers.
Thanks for reaching out to us.
Could you please try using _mm512_i32gather_pd intrinsic to load the data in AVX512 registers as you can use scale factor for an index vector.
Please refer to below link for more details:
>>What is the fastest way to do this.
To work efficiently, one does not update the indexes, instead, it is better to update the base address to the (next) first of stride to load (this conserves a vector register).
Thanks & Regards,
Thanks, These commands help me in doing so but the performance is not as I was expecting. The copy of stride data using cilk array is better than using AVX512.
According to intel intrinsic guide the gather latency is high as compared to the load or store. But I am facing performance degradation on store operation as compared to gather.
__m512d _A0 = _mm512_i64gather_pd(vidx , &AS[source_location], 8);
These copy commands are with in nested loops which are parallelized using OpenMP. In every iteration location is changed. I observed that store operation take most of the time. Any good way to optimize it?