Quote:Igor Astakhov (Intel)

geetanjali_b_ · ‎07-23-2019

The recent replacement of a loop into "ippsMulC_32f_I" function introduces performance slowdown on first few iterations.

After few iteration of run, the performance is good.

IPP Version: IPP 7.0

Linking Type: Static Linking

Is there any way to avoid this initial slow down and improve the performance?

My application code is compiled with -mtune=core7-avx optimization.

I hope the IPP 7.0 has optimized for SSE3.0. Will this avx to sse makes some issue? If yes, why only for few iterations and later it gives good performance?

Igor_A_Intel · ‎07-24-2019

Hi,

IPP 7.0 has AVX optimized code path for ippsMulC_32f_I, therefore if you mean AVX->SSE transition issue - it should not be the root of the issue you've faced with. First you should check that E9 (for x64, if ia32 - G9) version of IPP code is dispatched. For this purpose please insert several lines of code before IPP function call:

    const IppLibraryVersion *lib;

    lib = ippsGetLibVersion();
    printf( "CPU       : %s\n", lib->targetCpu );
    printf( "Name      : %s\n", lib->Name );
    printf( "Version   : %s\n", lib->Version );
    printf( "Build date: %s\n", lib->BuildDate );

regards, Igor

geetanjali_b_ · ‎07-24-2019

Hello Igor,

Thanks for your reply and I just observed this behaviour. Could you please also answer few question posted at the end of this comment?

1. Please find the code results:

CPU : m7
Name : ippsm7_l.a
Version : 7.0 build 205.105
Build date: Apr 7 2012

2. With using ippInit() call:

The ippGetCpuType() function returns => 70 (0x46) => ippCpuAVX
ippGetEnabledCpuFeatures() returns => 4063 => Supports for ippCpuAVX

3. Without using the ippInit() call:

The ippGetCpuType() function returns => 70: 0x46 => ippCpuAVX
ippGetEnabledCpuFeatures() returns 7 => SSE2

So it's clear that ippInit() call sets optimization by selecting the CPU type as "AVX".

Also I could able to see some slight improvements in performance by adding "ippSetNumThreads(1)". Also this returns ippStsNoOperation.

Questions:

1. Other than ippInit(), any other ipp calls required to avoid the slowness on first iteration?

2. If libraries are linked static, the threads are always set to 1. How still "ippSetNumThreads(1)" adding performance benefits on first iteration?

3. Is there any other way to increase the performance for first iteration?

geetanjali_b_ · ‎07-29-2019

Any update on above question?

Igor_A_Intel · ‎07-30-2019

Hi,

The optimized version of code is significantly longer than simple C-loop - it has unrolling, alignment prolog and tail processing epilog - therefore requires significantly more place in the instruction cache. This is one of the reasons that the 1st iteration is always slower than the next. I haven't seen your measuring code - if you use the same data buffers for the 1st and the next measuring loops - than the same reason is true for the data caches. Another reason is branch-prediction unit - it is gathering branch statistic during the 1st loop and applies it for the next ones. Of course there can be some other reasons that can be obtained and analysed with some special tool - for example with Intel(r) Amplifier. Therefore all reasons can be clear only after thorough investigation of your code with the special tool. At this point we can only provide some guesses and some general reasons.

Regarding your numerated questions:

1) in the latest IPP versions, if ippInit() has not been called before any call to some processing function, the initialization of the library (detecting CPU capabilities and selecting the most appropriate code path) is performed during the 1st call to any processing IPP function - that can be the reason of the 1st iteration slowness - but all further calls will use the best code path. For 7.0 the only possible way to initialize the library is ippInit(). No any other calls are required.

2) SetNumThreads() does nothing in the case of single threaded static library - therefore a call of this function can influence performance only if its code lays on the same code page with the further executed code (it's low probability that MulC code is on the same page, but for example ippGetCpuClocks() from the same core domain has high probability - if it is used for measurements).

3) I guess - no. To warm instruction cache and data caches (and branch predictor and other units) you should execute this code at least once.

regards, Igor

geetanjali_b_ · ‎07-30-2019

Hello Igor,

Thanks for your response and I will try collecting further details from Intel(r) Amplifier.

please find my response for your assumptions.

if you use the same data buffers for the 1st and the next measuring loops - than the same reason is true for the data caches

Ans: Yes, I am using the same data for consecutive iterations as well.

1) in the latest IPP versions, if ippInit() has not been called before any call to some processing function, the initialization of the library (detecting CPU capabilities and selecting the most appropriate code path) is performed during the 1st call to any processing IPP function - that can be the reason of the 1st iteration slowness - but all further calls will use the best code path. For 7.0 the only possible way to initialize the library is ippInit(). No any other calls are required.

Question:

IPP 7.0: Will the second iteration could have possible optimization?
or Is this set to "SSE2" for all iterations?

(it's low probability that MulC code is on the same page, but for example ippGetCpuClocks() from the same core domain has high probability - if it is used for measurements).

Question:

So, having this call will not impact the performance and also this may add some benefit?

Igor_A_Intel · ‎07-30-2019

1) Hm... If you use the same data for consecutive operations - this is the main reason of the picture you see. All other reasons like instruction cache and branch predictor warming are on the 10th place in comparison with the data cache warming - loading data from memory is measured in hundreds of nanoseconds, while loading from the different cache levels - in cpu clocks - from 2 (1st level) to ~50(last level)

2) In 7.0 (and all other versions before 9.0 or 9.1 - I don't remember exactly) there is no auto-initialization feature in the static libraries. Therefore if ippInit() is not called - the lowest possible optimization (default code version) in the library will work - SSE3 (m7, Intel64) in your case.

3) regarding "add some benefit" - it was only assumption. To a greater extent it should not impact performance at all. But of course there is a non-zero probability that some other calls may influence if their code is located at the same code page (usually 4-K) as your function of interest - it is usual case for dynamic libraries that are loaded by pages on Windows.

regards, Igor

geetanjali_b_ · ‎07-30-2019

Hello Igor,

Thanks for your clarifications. :) :)

It's clear now and I will continue to analyse with "Amplifiers" and post my observations, if anything found.

Using ippsMulC_32f_I introduces some initial performance slowdown