Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

Load data in AVX512 vector with strides

Rizwan1
Beginner
1,008 Views

Hi,

 

I need to load data with strides in AVX512 vector register. What is the best way to do this.

Lets suppose the stride is of 1000 and I need to load data at index 0,1*1000, 2*1000, 3*1000, 4*1000, 5*1000 , 6*1000 and 7*1000 in one AVX512 vector register.

What is the fastest way to do this. Which intrinsic should be used to do this. Data is double precision floating point numbers.


0 Kudos
6 Replies
NoorjahanSk_Intel
Moderator
968 Views

Hi,


Thanks for reaching out to us.


Could you please try using _mm512_i32gather_pd intrinsic to load the data in AVX512 registers as you can use scale factor for an index vector.

Please refer to below link for more details:

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#=undefined&ig_expand=3910,4260,6145,3907,3913&text=%25252520_mm512_i32gather


>>What is the fastest way to do this.


To work efficiently, one does not update the indexes, instead, it is better to update the base address to the (next) first of stride to load (this conserves a vector register).


Thanks & Regards,

Noorjahan.


0 Kudos
NoorjahanSk_Intel
Moderator
930 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?


Thanks & Regards,

Noorjahan.


0 Kudos
Rizwan1
Beginner
913 Views

Thanks, These commands help me in doing so but the performance is not as I was expecting. The copy of stride data using cilk array is better than using AVX512. 

 

According to intel intrinsic guide the gather latency is high as compared to the load or store. But I am facing performance degradation on store operation as compared to gather.

 

__m512d _A0 = _mm512_i64gather_pd(vidx , &AS[source_location], 8);
_mm512_storeu_pd(&AD[destination_location], _A0);

 

These copy commands are with in nested loops which are parallelized using OpenMP. In every iteration location is changed. I observed that store operation take most of the time. Any good way to optimize it?

 

Thanks

 

0 Kudos
NoorjahanSk_Intel
Moderator
898 Views

Hi,


Could you please provide us with a complete reproducer, so that we can investigate more on your issue?



Thanks & Regards

Noorjahan.


0 Kudos
NoorjahanSk_Intel
Moderator
867 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue along with the above-requested details?



Thanks & Regards,

Noorjahan.



0 Kudos
NoorjahanSk_Intel
Moderator
833 Views

Hi,


I have not heard back from you, so I will close this inquiry now. If you need further assistance, please post a new question.


Thanks & Regards,

Noorjahan.


0 Kudos
Reply