Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
7750 Discussions

Load data in AVX512 vector with strides

Rizwan1
Beginner
429 Views

Hi,

 

I need to load data with strides in AVX512 vector register. What is the best way to do this.

Lets suppose the stride is of 1000 and I need to load data at index 0,1*1000, 2*1000, 3*1000, 4*1000, 5*1000 , 6*1000 and 7*1000 in one AVX512 vector register.

What is the fastest way to do this. Which intrinsic should be used to do this. Data is double precision floating point numbers.


0 Kudos
6 Replies
NoorjahanSk_Intel
Moderator
389 Views

Hi,


Thanks for reaching out to us.


Could you please try using _mm512_i32gather_pd intrinsic to load the data in AVX512 registers as you can use scale factor for an index vector.

Please refer to below link for more details:

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#=undefined&ig_expand=3910,4...


>>What is the fastest way to do this.


To work efficiently, one does not update the indexes, instead, it is better to update the base address to the (next) first of stride to load (this conserves a vector register).


Thanks & Regards,

Noorjahan.


NoorjahanSk_Intel
Moderator
351 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?


Thanks & Regards,

Noorjahan.


Rizwan1
Beginner
334 Views

Thanks, These commands help me in doing so but the performance is not as I was expecting. The copy of stride data using cilk array is better than using AVX512. 

 

According to intel intrinsic guide the gather latency is high as compared to the load or store. But I am facing performance degradation on store operation as compared to gather.

 

__m512d _A0 = _mm512_i64gather_pd(vidx , &AS[source_location], 8);
_mm512_storeu_pd(&AD[destination_location], _A0);

 

These copy commands are with in nested loops which are parallelized using OpenMP. In every iteration location is changed. I observed that store operation take most of the time. Any good way to optimize it?

 

Thanks

 

NoorjahanSk_Intel
Moderator
319 Views

Hi,


Could you please provide us with a complete reproducer, so that we can investigate more on your issue?



Thanks & Regards

Noorjahan.


NoorjahanSk_Intel
Moderator
288 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue along with the above-requested details?



Thanks & Regards,

Noorjahan.



NoorjahanSk_Intel
Moderator
254 Views

Hi,


I have not heard back from you, so I will close this inquiry now. If you need further assistance, please post a new question.


Thanks & Regards,

Noorjahan.


Reply