Solved: Store performance using AVX512

Rizwan1 · ‎05-18-2022

I am using AVX512 in my code and copying data from one location to another location. Store operation performance is not as expected.

For example

__m512d _A0 = _mm512_loadu_pd(&A[0]);

__m512d _A1 = _mm512_loadu_pd(&A[1000]);

_mm512_storeu_pd(&A[1000], _A1);

If I store the data loaded in _A1 and store back at the same memory location then performance is Fine. But If I try to update this memory location with data of _A0 performance is bad.

_mm512_storeu_pd(&A[1000], _A0);

Even if I copy the data of _A0 into _A1 and then try to store back the performance is again bad.

_A1=_A0;

_mm512_storeu_pd(&A[1000], _A1);

What is the best way to handle this?

Subsequent question is, How can I copy zmm register data into another zmm register data without any performance penalty. My target is to load data in one zmm register manipulate this data and then store back. When every I manipulate the using simple assignment (As shown in above example _A1=_A0) overall performance is very bad.

How can I resolve this issue, Any Idea?

Gernot_Intel · ‎06-22-2022

Since there do not seem to be any further questions we consider this thread resolved and we will no longer monitor it. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only

View solution in original post

HemanthCH_Intel · ‎05-19-2022

Hi,

Thanks for posting in Intel Communities.

>>store back at the same memory location then performance is Fine.

Could you please let us know how you are comparing the performance of the intrinsics?

Thanks & Regards,

Hemanth

Rizwan1 · ‎05-19-2022

Dear

Please see the below example. If I load data from index 0 and want to store it at index 1000 then the performance is worst but if I store again it at index 0 without doing any change in data then the performance is fine.

_m512d _A0 = _mm512_loadu_pd(&A[0]);

__m512d _A1 = _mm512_loadu_pd(&A[1000]);

_mm512_storeu_pd(&A[1000], _A0); -> (Problem)

_mm512_storeu_pd(&A[1000], _A1); -> (Fine)

My goal is to load data from one location and store it an other location in memory.

Rizwan1 · ‎05-22-2022

Any update

HemanthCH_Intel · ‎05-27-2022

Hi,

We are working on this internally and will get back to you soon.

Thanks & Regards,

Hemanth

Gernot_Intel · ‎06-01-2022

Hey Rizwan1,

throughput to memory depends on several factors. If you work on cached memory such as DRAM and write the same values back then I am not surprised to see pretty high values. This is because there is no real need to write the data all the way back into DRAM. Compiler and/or hardware optimizations can catch this case and make it very fast. For uncached memory such as I/O space, on the other hand, there should not be such a difference.

Another factor is alignment. If your data is not properly aligned for SIMD instructions, you will have to implement special slower prologue and epilogue loops to tread the nonaligned portions. Furthermore, I would like to note that automatic compiler vectorization can do a pretty good job in using SIMD registers for simple kinds of loops.

In any case, I would suggest to start with publicly available memcpy() and memset() reference implementations, e.g. those of Linux/glibc. These implementations proved to be both stable and performant.

I hope this sheds some light on your observations. If you still have open questions please provide a complete (source code) test case including performance measurements/expectations as well as quantitative results, so that we can take a deeper look at it.

Thanks,

Gernot

Gernot_Intel · ‎06-22-2022

Since there do not seem to be any further questions we consider this thread resolved and we will no longer monitor it. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only