Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.

Store performance using AVX512

Rizwan1
Beginner
556 Views

I am using AVX512 in my code and copying data from one location to another location. Store operation performance is not as expected.

 

For example 

 

__m512d _A0 = _mm512_loadu_pd(&A[0]);

__m512d _A1 = _mm512_loadu_pd(&A[1000]);

 

 

 

_mm512_storeu_pd(&A[1000], _A1);    

 

If I store the data loaded in _A1 and store back at the same memory location then performance is Fine. But If I try to update this memory location with data of _A0 performance is bad.

 

_mm512_storeu_pd(&A[1000], _A0);

 

Even if I copy the data of _A0 into _A1 and then try to store back the performance is again bad.

_A1=_A0;

_mm512_storeu_pd(&A[1000], _A1);    

 

What is the best way to handle this?

 

 

Subsequent question is, How can I copy zmm register data into another zmm register data without any performance penalty.  My target is to load  data in one zmm register manipulate this data and then store back. When every I manipulate the using simple assignment  (As shown in above example _A1=_A0) overall performance is very bad.

 

How can I resolve this issue, Any Idea?

 

 

 

0 Kudos
1 Solution
Gernot_Intel
Moderator
320 Views

Since there do not seem to be any further questions we consider this thread resolved and we will no longer monitor it. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


View solution in original post

6 Replies
HemanthCH_Intel
Moderator
527 Views

Hi,

 

Thanks for posting in Intel Communities.

 

>>store back at the same memory location then performance is Fine.

Could you please let us know how you are comparing the performance of the intrinsics?

 

Thanks & Regards,

Hemanth

 

Rizwan1
Beginner
508 Views

Dear

 

Please see the below example. If I load data from index 0 and want to store it at index 1000 then the performance is worst but if I store again it at index 0 without doing any change in data then the performance is fine.

 

_m512d _A0 = _mm512_loadu_pd(&A[0]);

__m512d _A1 = _mm512_loadu_pd(&A[1000]);

 

_mm512_storeu_pd(&A[1000], _A0);        ->      (Problem)

_mm512_storeu_pd(&A[1000], _A1);        ->       (Fine)

 

 

My goal is to load data from one location and store it an other location in memory. 

 

Rizwan1
Beginner
480 Views

Any update

HemanthCH_Intel
Moderator
440 Views

Hi,


We are working on this internally and will get back to you soon.


Thanks & Regards,

Hemanth


Gernot_Intel
Moderator
399 Views

Hey Rizwan1,


throughput to memory depends on several factors. If you work on cached memory such as DRAM and write the same values back then I am not surprised to see pretty high values. This is because there is no real need to write the data all the way back into DRAM. Compiler and/or hardware optimizations can catch this case and make it very fast. For uncached memory such as I/O space, on the other hand, there should not be such a difference.


Another factor is alignment. If your data is not properly aligned for SIMD instructions, you will have to implement special slower prologue and epilogue loops to tread the nonaligned portions. Furthermore, I would like to note that automatic compiler vectorization can do a pretty good job in using SIMD registers for simple kinds of loops.


In any case, I would suggest to start with publicly available memcpy() and memset() reference implementations, e.g. those of Linux/glibc. These implementations proved to be both stable and performant.


I hope this sheds some light on your observations. If you still have open questions please provide a complete (source code) test case including performance measurements/expectations as well as quantitative results, so that we can take a deeper look at it.


Thanks,

Gernot


Gernot_Intel
Moderator
321 Views

Since there do not seem to be any further questions we consider this thread resolved and we will no longer monitor it. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Reply