Data alignment apparently provides no gain in performance with AVX512

dtroncho · ‎08-29-2022

Hello,

We have been doing some testing with this environment:

Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.2.6

Intel C++ Compiler 2022 v143

Compiler's command line: /Qfma /QxCORE-AVX512

OpenMP Support: Generate Parallel Code (/Qiopenmp)

I run everything on a 11th Gen Intel Core i7-1165G7 (which apparently supports AVX512).

In the C/C++ code, we have aligned data at 64 bytes by (following this document: https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-essential.html

Using _mm_malloc and _mm_free, instead of the malloc and free standard calls.
Additionally, in the OpenMP loops, I have added: #pragma omp simd aligned(pMatrixA:ALIGNMENT_BYTES) aligned(pMatrixB:ALIGNMENT_BYTES)

We have not done anything else to use aligned data.

The problem is that we run several tests aligning and not aligning data, and we do not measure any gain in performance:

Are we missing anything? Should we know/do anything else?
Maybe we are doing the tests in Release or Debug: should we do the tests only in Release? Although we think we did not find any gain in both.

By the way: should data alignment be beneficial for AVX2? Also at 64 bytes, 32 bytes, depends?

I look forward to your help and thanks in advance.

Best regards,

David.

NoorjahanSk_Intel · ‎08-30-2022

Hi,

Thanks for reaching out to us.

>>Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.2.6

Sorry for the inconvenience, Visual Studio Version 17.2.6 is not supported as of now with the latest Intel oneAPI toolkit. Try using the supported versions and let us know if you are still facing any issues.

Please refer to the below link for the supported versions of the Intel oneAPI toolkit.

https://www.intel.com/content/www/us/en/developer/articles/reference-implementation/intel-compilers-compatibility-with-microsoft-visual-studio-and-xcode.html

If you still face the same issue, please provide us with the following details:

1. Sample reproducer and steps you have followed so that we can try it from our end

2. How do you measure the performance gain?

Thanks & Regards,

Noorjahan.

dtroncho · ‎09-02-2022

Hello,

Thanks for your response.

Regarding the VS version: I cannot go back to the indicated version as Microsoft, as far as I know, does not provide download to specific releases. Then, what can I do to have your support?

Regarding your question 1: Sample reproducer and steps you have followed so that we can try it from our end

This is what I do:

Compile the program with option: /QxCORE-AVX512
Reserve memory with: _mm_malloc(pLngSize, 64)
I only inform the compiler of the aligned memory in my omp directives, like this: #pragma omp simd reduction(+:lSngTot) aligned(pMatrixA:64) aligned(pMatrixB:64)
Free memory with: _mm_free(pPtr)

Regarding your question 2: How do you measure the performance gain?

My program calculates the estimated time to complete with both aligned and not aligned memories and it does not appear any significative difference, more on the contrary, I just measured worst performance with the aligned approach.

Am I missing anything?

I look forward to your help and thanks in advance.

NoorjahanSk_Intel · ‎09-09-2022

Hi,

Thanks for providing the details.

It would be a great help if you provide a complete reproducer so that we could test it from our end.

If you do not want to share the code publicly, You have a choice to send your source code by private message. So if you are willing to send it, please do let us know, so that we can contact you privately.

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎09-16-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks & Regards,

Noorjahan.

dtroncho · ‎09-19-2022

Hello Noorjahan,

Please open the attached file. As you can see, there are 4 analysis:

NOT ALIGNED 2L: execution of 2 epochs learning the MNIST handwritten digits with a neural network of 2 layers, executed with memory not aligned.
ALIGNED 2L: exactly the same than before, but memory is aligned, following what is indicated above.
NOT ALIGNED 3L: same thing that NOT ALIGNED 2L but with a neural network of 3 layers.
ALIGNED 3L: same as ALIGNED 2L but with a neural network of 3 layers.

In all 4 cases:

We are showing, in descending order, the loops that consume the most time of our program.
Executions analyzed with Intel Advisor 2022.1.
Same source code.
Intel C++ Compiler 2022, with command line option: /QxCORE-AVX512
Source code uses Open MP, highly optimized (I think).
Column "Self GFLOP" has been added to validate that the same (or very very similar) number of operations are performed in the case of aligning and not aligning the memory. As you can see, that is apparently correct.

Questions that arise:

The sum of all Self time is less for the ALIGNED 2L versus the NOT ALIGNED 2L, and the same for the 3L:
- Does that mean that aligning memory results in faster execution?
- Is there any other column in Intel Advisor which would be better to measure faster execution?
Please look at line 3423 (row 12 in excel file): aligned versions are not vectorized, and in that source code line we were neither informing to compiler of aligned memory (openmp directive: aligned) nor forcing vectorization (openmp directive: for simd). Therefore, when memory is aligned, apparently the compiler decides not to vectorize that loop. Why? Now, we have changed the source code to force vectorization of that loop and we are informing the compiler of aligned memory in that loop.
Please look at line 1973 (row 11 in excel file): when 2 layers of neural network, compiler apparently has decided not to vectorize that loop, whilst with 3 layers compiler decided to vectorize it with AVX512. Source code of that loop was forcing vectorization (directive: omp simd) and informing of aligned memory (omp directive: aligned). Therefore, I assume that the compiler created assembly code which is deciding, at run time and according to the number of iterations of the loop, to vectorize or not. Could this be?
This is not a question, just a comment: please look at line 1964 (row 13 in excel file): lines 1964 and 1973 are outer and inner loop respectively and, therefore, it makes sense that when it "simd"es the inner (1973), does not simd the outer (1964).

Probably, our most important challenge now is:

Most of our loops are "memory bound" to increase performance. I assume that our only way to make our program faster is understanding the particularities of our NUMA architecture, understand how data is travelling and try to optimize this for the cores to have fast and continuous data flow. But, that, somehow looks like attaching our program to particular hardware architectures, and that is something we do not want to do. Therefore: what would be your recommendations to increase performance with OpenMP when your loops are "memory bound" and you want to stay hardware independent?

I look forward to your answers and help and thank you in advance.

Best regards,

David.

NoorjahanSk_Intel · ‎09-22-2022

Hi,

Thanks for providing the details.

Data alignment increases the efficiency of data loads and stores to and from the processor.

We have to inform the compiler that this data is aligned where that data is actually used in the program otherwise data alignment will not be applicable.

Please refer to the below link for more details.

https://www.intel.com/content/www/us/en/developer/articles/technical/data-alignment-to-assist-vectorization.html

You can use Hotspot classification in Intel VTune to get a faster execution time.

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/algorithm-group/basic-hotspots-analysis.html

You can also refer to the below link for more details regarding Intel Advisor

https://www.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top.html

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎09-29-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎10-07-2022

Hi,

I have not heard back from you, so I will close this inquiry now. If you need further assistance, please post a new question.

Thanks & Regards,

Noorjahan.

Olórin · ‎12-01-2025

Again another example (among many others) of strictly no valuable help provided to help regarding your C/C++ compiler. And at the end, as often, the OP gives back and you close the question, increasing your quota of resolved issue when in this case you have resolved nothing, except increasing the OP's time spent trying to get help from you while you perfectly know you won't help. Nice.