Long delays on first FFT after having thread wait

Eric_ · ‎04-27-2023

I have observed some unexplained behavior with IPP running on a Xeon Gold 5218N processor and could use some additional information.

I am performing multiple 8K FFTs using IntelOne's IPP library. All data structures have been pre-allocated, and IPP is set to utilize the AVX-512 registers. I noticed that after not utilizing the AVX512 for some unknown period of time, the first FFT takes significantly longer than follow-up FFTs. The first FFT is roughly 3.5-4X slower than the follow-up FFTs.

Based on other comments about AVX-512 behavior, I have ruled out caching because I can seemingly keep the AVX-512 "warm" by issuing dummy AVX-512 OR instructions via asm. This makes the first FFT take marginally longer than the follow-up FFTs (which I happily will attribute to caching issues). I also do not notice this extreme timing variance if I instead use AVX2.

I would like to have a correct understanding of what causes this behavior and how to mitigate it, rather than relying on empirical testing that will miss corner cases and causes. My first thought is that this is a architecture design question for Intel, but when I reached out they indicated that this was an IPP issue and that I should find help here instead.

ShanmukhS_Intel · ‎04-28-2023

Hi Eric,

Thanks for posting in Intel Communities.

Thanks for sharing the details. Could you please confirm us the environmental details (IPP version, OS details etc.) and it would be a great help if you share with us a sample reproducer as it helps us reproduce the issue at our environment and assist you accordingly.

Best Regards,

Shanmukh.SS

Eric_ · ‎05-02-2023

Thank you for your patience.

I am running on RedHawk 7 with IntelOne 2022.2.

Please see the attached file for code, there may be a few typos that need correcting.

Note the startup time that occurs after the first sleep, and then note the significantly reduced startup time after the second sleep where I manually run some AVX512 instructions beforehand. Critically, this only works if I run AVX512 instructions. If I run AVX256 or AVX128 instructions we see no change to the startup delay.

In my other code it is sufficient to run the occasional AVX512 instruction as part of a polling loop.

ShanmukhS_Intel · ‎05-09-2023

Hi Eric,

Thanks for sharing the details. We were able to run the shared source code.

We would like to request you use the latest version of Intel oneAPI(2023.1.0) available with a list of supported OS/configuration and let us know if the issue persists.

Please find the below link which helps you with information on the system requirements for the Intel® Integrated Performance Primitives (Intel® IPP).

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-intel-integrated-performance-primitives-system-requirements.html

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎05-15-2023

Hi Eric,

A gentle reminder:

Could you please get back to us if the issue persists with the latest version of Intel oneAPI(2023.1.0) available with a list of supported OS/configuration

Best Regards,

Shanmukh.SS

Eric_ · ‎05-16-2023

Hi Shanmukh, thank you for your patience.

I compiled the above code using the most recent oneAPI (2023.1.0) and ran it on RHEL 8.7. I ran on the same hardware as previous tests.

I observed the same behavior: Significant slowdowns in IPP/AVX512 performance after having the program wait, which can be alleviated by calling dummy AVX512 instructions beforehand.

-Eric

ShanmukhS_Intel · ‎05-22-2023

Hi Eric,

Thanks for your reply.

Depending on the specific IPP functions you're using, there may be initialization or setup operations that need to be performed before the actual computation. These initialization steps can introduce additional overhead in the first iteration. In subsequent iterations, these overhead operations may have already been completed, resulting in faster execution.

As modern processors have different levels of cache memory that store frequently accessed data. During the initial iteration, the necessary data may not be present in the cache, which might lead to slower performance as the data needs to be fetched from the main memory initially. However, in subsequent iterations, the data is more likely to be present in the cache, which might result in faster access times and improved performance.

In addition, it's also important to note that the performance behavior may vary depending on the specific code, compiler optimizations, hardware architecture, and other factors. Could you please get back to us if you need any other information needed from our side?

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎05-28-2023

Hi Eric,

A gentle reminder:

Has the information provided helped? Could you please let us know if we could help you with any other information?

Best Regards,

Shanmukh.SS