Solved: data alignment question

Wobo · ‎06-28-2022

We have user code that runs fine when using AVX2.0 ipp library functions but it crashes with an exception, if it is run using AVX512 ipp library functions.
I implemented some code to force the program to switch off using AVX512 if I want to.

The computer platform is based on two Intel Xeon Skylake processors and Windows 10 Enterprise LTSC 2019.
I wonder, if it can have to do something with bad data alignment?

Here is something written about it: https://www.intel.com/content/www/us/en/develop/documentation/dev-guide-ipp-for-oneapi/top/programming-considerations/managing-memory-allocations.html

First question: Can these strange exceptions we experience, when working with AVX512 IPP library functions, possibly be caused if data alignment is not done 100% properly in the user code (e.g. partially without using ippsMalloc_xx for every bit of data that is used with IPP functions)?

Second question: In a user function:

int F1(void)
{
Ipp32f abc[64];

...use abc with IPP functions...

}

In such a user function like F1() (local variable probably on stack?), is "abc" also not properly aligned, because it was not created with ippsMalloc_32f() and can this lead to problems using IPP library functions, especially under AVX512?

Regards

Ruqiu_C_Intel · ‎07-08-2022

Hi,

IPP need not force users to allocate byte alignments in their applications. But if the data is aligned, say 64 bytes in AVX512, it will perform better.

Based on our internal testing, we were unable to reproduce the issue. So please provide your sample replicator for us to investigate.

Have a good day.

BRs,

Ruqiu

View solution in original post

NoorjahanSk_Intel · ‎06-29-2022

Hi,

Thanks for reaching out to us.

Could you please provide us with a sample reproducer code along with the steps to reproduce the issue to investigate more from our end?

Thanks & Regards,

Noorjahan

Wobo · ‎06-30-2022

Hello,

it is really huge and very complex code and consists of many/different threads. Exceptions occure in different places and always if (IPP-)DFT tries to compute. However I wrote a more simple test program where I also use some IPP functions and also DFT computation and this more simple program (but of course computes the stuff also in 24 parallel threads on 24 Skylake cores, I aligned used data always with ippsMalloc_xx properly) does not see any exception (AVX2.0 or using AVX512 IPP functionality). This way I dont't want to blame IPP-DFT, I could imagine that we do something odd/bad concerning data alignment as IPP requires it, especially when using AVX512 IPP functionality (when using only SIMD up to AVX2.0, we have no problems with the mentioned code).

We use (Embarcadero) C++ Builder 64Bit clang compiler. Even if I could send you our problematic code (I can't , it is confidential), you probably could not use it, because of our different compiler. However, can somebody answer the two questions I asked in my previous post? I think somebody, maybe from Intel, who knows the working principle of IPP very well could answer these questions. I repeat them here for straightforwardness.

First question: Can these strange exceptions we experience, when working with AVX512 ipp library functions, possibly be caused if data alignment is not done 100% properly in the user code (e.g. without using ippsMalloc_xx for every bit of data that is used with IPP functions)?

(https://www.intel.com/content/www/us/en/develop/documentation/dev-guide-ipp-for-oneapi/top/programming-considerations/managing-memory-allocations.html)

Second question: In a user function:

int F1(void)
{
Ipp32f abc[64];

...use abc with IPP functions...

}

In such a user function like F1() (local variable probably on stack?), is "abc" also not properly aligned, because it was not created with ippsMalloc_32f and can this lead to problems using IPP library functions, especially under AVX512?

Regards

Wobo · ‎07-06-2022

Hello again,

since I did not get an answer on any of my two questions, I evaluated the situation further by myself without it.

Main problem is: Why does DFT, when working with AVX-512, causes sporadically (mostly seen after about 2 minutes, about 15 minutes or about 27 minutes doing computation using DFT heavily) excpeptions of different kinds (floating point overflow, illegal floating point operation, ...)?

When DFT is used with SIMD up to AVX2.0, there I see absolutely no problem (always using same code, just enabled/disabled AVX-512 for IPP libraries, using 64bit compile/DLLs). The code also uses quite a lot of other IPP-functions in between.

Finally I completely replaced DFT with FFT and working this way, the program works absolut stable, no longer sproradic exceptions, everything fine, except the computation takes longer because power of 2 value/sample length (required for FFT). This was the reason why to use DFT...just to shorten unnecessary calculation time. However DFT does not work stable as FFT does (at least with our code).

However, back to my two questions from my previous post...now I think the DFT problem is probably not related to some data alignment issues, because the FFT counterpart does obviously not care about this and works totally stable (even when using AVX-512) in the exact same code and delivers results as expected.

Behavior of our code in short:

DFT SIMD up to AVX2.0, everything fine
FFT SIMD up to AVX2.0, everything fine
DFT SIMD up to AVX-512, sporadically exceptions
FFT SIMD up to AVX-512, everthing fine

Of course I can't provide the code for somebody else. It is confidential. A quite simple (but also multi-threaded, 24 threads) DFT test program with only 2 other IPP functions (ippsZero_32fc, ippsCopy_32fc) in use did not cause such kind of DFT-exceptions (when working with AVX-512), even after hours of computation. Therefore, the whole thing stays a mystery for me...

Regards

NoorjahanSk_Intel · ‎07-06-2022

Hi,

We are working on your issue. We will get back to you soon.

Thanks & Regards,

Noorjahan.

Ruqiu_C_Intel · ‎07-08-2022

Hi,

IPP need not force users to allocate byte alignments in their applications. But if the data is aligned, say 64 bytes in AVX512, it will perform better.

Based on our internal testing, we were unable to reproduce the issue. So please provide your sample replicator for us to investigate.

Have a good day.

BRs,

Ruqiu

Wobo · ‎07-08-2022

>>IPP need not force users to allocate byte alignments in their applications.
Ok, thank you for your answer. This way I think the sporadic crashs of DFT under AVX-512 do not come from bad data aligment.

>>But if the data is aligned, say 64 bytes in AVX512, it will perform better.
I tested this yesterday too and yes I saw a small performance gain when used data was properly aligned.

>>Based on our internal testing, we were unable to reproduce the issue.
Reproducing this issue is very very hard. In most cases you have to do DFT calculations (done together with using lost of other IPP functions in our code) for about at least 30 minutes. After that time it is rather sure that the DFT crash occures (when working with AVX-512).

This never happens if you force the library to use not AVX-512 command set, use SIMD up to AVX2.0 only. I see this effect already quite a long time, years ago on a Dual Haswell/Windows platform, lately on a Dual Skylake/Windows platform (2 x Xeon 6146) and yesterday I found the same on a Ice Lake/Windows platform (2 x Xeon 6346). Over the years I saw this with a few different versions of IPP library (currently using the latest).

However, runtime tests showed on Skylake SIMD up to AVX2.0 use only is a couple of minutes faster than using SIMD up to AVX-512.

On the Ice Lake platform, AVX2.0 ist slightly faster than AVX-512 or maybe equal to AVX-512, at least with our application, which makes huge use of IPP functions, many float, partially double floating point operations.

Hence, we have no advantage of using AVX-512, we stick to AVX2.0 and DFT. This works fastest for us and most important stable!

>>So please provide your sample replicator for us to investigate.
A simple test program I wrote, did not reproduce the mentioned issue so far, but it is by far super simple compared to our production code (which uses many many different IPP functions), which has the issue.

Hence, currently I don't have any code that I could send you to reproduce this issue. If this changes someday I could send you code in form of an Embarcadero C++ project (including corresponding C++ source files). Don't believe that you can directly use it in such a form, but we only use that IDE.
Our production code cannot be sent to you.

Thanks for your effort.

Regards

Ruqiu_C_Intel · ‎08-25-2022

Hello,

We can't reproduce your issue without simple reproducer. Hope you have already fixed the issue.

Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.

Regards,

Ruqiu

Wobo · ‎08-25-2022

Hello,

I wrote a much more complex test program to check the situation about DFT in combination with AVX-512. This test program uses a similar amount of IPP functions (and has similar functionality) as the problematic program/code. However, my more complex test program does work absolute stable, even in combination of DFT usage with AVX-512. So I tend to see a problem somewhere in our/the problematic program/code (even if this works stable using DFT under AVX2) and not in any IPP function. Maybe we find some day the real reason/bug in our program...However, it is not very urgend now, because the program runs as expected/stable under AVX2.

Regards