ippsFIR64f_32f unusual processor usage

s_smirnov · ‎01-12-2012

We saw an issue in method ippsFIR64f_32f() (IPP 6.0.1.070). We use this method for real-time data filtering. We saw that after changing data length from 1600 samples to 1601 samples computer processor usage changes from 2% to near 100%. Whole test project for Visual Studio 2010 is attached.

Thomas_Jensen1 · ‎01-13-2012

You allocate taps in C# code, but I think you should use IPP allocation functions for all or most of array inputs to IPP functions, to ensure proper memory aligment, and thus ensuring highest processing speed.

s_smirnov · ‎01-13-2012

According to your suggestion we made all memory allocation via IPP methods, but without success. You can check attached Visual Studio project. Please see that in these two cases we use the same memory block, only one parameter (block lengh) is changed

igorastakhov · ‎01-13-2012

1600 is a criterion for switching from ST to MT code - if you don't see any benefit from multithreading - you should link with non-threaded static library or setIPP num threads to 1.

Regards,
Igor

Thomas_Jensen1 · ‎01-14-2012

I have also seen that certain IPP functions switch implementation modes when parameters exceed certain limits; this is smart coding in my opinion.

I do miss documentation of these implementation modes for each function that has it.

Thomas_Jensen1 · ‎01-14-2012

I guess the memory aligment was not the reason it was slow.

So, if IPP switches from single threading to multithreading when you expand 1600 to 1601, and you then see 100% cpu increase, I would say your code is not fully correct for that.

Of course, I don't know how you multithread, if you multithread.

You can have your app have its own multihthreading, but then you should let IPP do singlethreading.
If your app has no multthreading, you should let IPP do the multithreading by calling SetNumThreads(NumCPUs_That_Are_Not_used_in_Other_Threaded_Code), and by using the properly multithreaded IPP libraries, using OpenMP for instance.

If you have an Dual Core HT-cpu (hyper-threaded), then I guess you should use 2 threads, since the two extra HT threads are not at full cpu.

My point is, if you over-thread your app, performance will suffer.
Over-threading is when you tell you code to use more threads than your cpu can process at 100%; so a 2 core HT should run 2 threads. A 2 core non-HT should use 2 threads. A 4 core non-HT should use 4 threads. A 4 core HT should use 4 threads. Let the slower HT threads be used by the OS or the UI.

igorastakhov · ‎01-15-2012

Thomas,

yes, we don't document all internal criterions - they are specific for each architecture - for example this particular one is the next:

#ifdef _OPENMP

#include

#define STRT_OMP_DIR_R 1600

#define STRT_OMP_FFT_R 1600

#define STRT_OMP_DIR_C 800

#ifdef FIR_OPT_HT

#define STRT_OMP_FFT_C 800

#else

#define STRT_OMP_FFT_C 800

#endif

#endif

so you see - there is one more implementation - FIR via FFT and different criterion for HT - we can't overload documentation with all this stuff...

100% CPU load isan issue of OMP version used -try to set the blocktime at the beginning of the application via either environment variable or API call, e.g.

set KMP_BLOCKTIME=200

or

kmp_set_defaults("KMP_BLOCKTIME=200");

or

kmp_set_blocktime(200);

this should decrease CPU usage. There is no oversubscription - nested threading is disabled by default.

Regards,
Igor

s_smirnov · ‎01-16-2012

Unfortunately we could not call method kmp_set_blocktime() in C#.

This code:

[DllImport("libiomp5md.dll")]

static extern void kmp_set_blocktime(int value);

kmp_set_blocktime(200);

causes an error:

A call to PInvoke function 'IppsFIR64f_32f_Test!IppsFIR64f_32f_Test.MainForm::kmp_set_blocktime' has unbalanced the stack. This is likely because the managed PInvoke signature does not match the unmanaged target signature. Check that the calling convention and parameters of the PInvoke signature match the target unmanaged signature.

Please could you help us to correct this code?

igorastakhov · ‎01-16-2012

First of all we need to understand that the issue is really connected with blocktime - so could you try to set the environment variable

set KMP_BLOCKTIME=200

- if it solves your issue - then we can think on how to call OMP runtime functions from C#

Regards,
Igor

s_smirnov · ‎01-16-2012

Thank you for the information, we made test with environment variable and value "KMP_BLOCKTIME=0" solved CPU usage problem. Now we are trying to find a way to set this variable from C# code.

igorastakhov · ‎01-16-2012

"kmp" functions are OMP functions so I guess you need their prototipes or "omp.h" file for Intel OMP realisation (libguideXXX.dll).

# if defined(_WIN32)

# define __KAI_KMPC_CONVENTION __cdecl

# else

# define __KAI_KMPC_CONVENTION

# endif

extern void __KAI_KMPC_CONVENTION kmp_set_blocktime (int);
extern void __KAI_KMPC_CONVENTION kmp_set_defaults (char const *);

Regards,
Igor

levicki · ‎01-17-2012

Igor, I disagree about not having such dynamic behavior documented. If it is unpredictable and if it can cause issues and headache for developers (as it turned out here), then it has to be documented.

I cringe whenever I look at IPP documentation which looks machine generated and which always presumes that those who use IPP must know everything on the particular subject.

igorastakhov · ‎01-17-2012

Hi Igor,

agree, documentation should be improved (it's not "machine generated") - it's one of the main goals for the nearest fututre releases. Anyway almost all functionality/algorithms used in IPP are compatible with Matlab - so our documentation provides enough info on functions parameters and return statuses, and you always can pick up additional information on DSP or Image processing from the web, wikipedia, Matlab help, etc. - so IPP manuals are not primer textbooks - they are technical manuals.

Regarding FIR:

- you see that at least 3 algorithms are used for single thread - and they have complex criterions based on tapsLen, vector length, data type used and Intel architecture (SSE2, SSE3, SSSE3, SSE4.1, AVX, etc.). For multi-thread these criterions are extended with one more. These criterions are IPP internals and can be changed from release to release based on current performance data - they are not a subject that should be or can be documented. We state in the documentation that dynamic libraries are threaded and provide a list of threaded functions. I guess that it's evident that each threaded function has internal criterions based on parameters when to use single threaded code and when - multi-threaded (threading always introduces some overhead - so it provides visible benefit only for some amount of work - below such criterion you'll see significant slowdown that is not permissible for perf libraries). So every threaded IPP function has such internal criterion that is different for each supported architecture. If you don't want to see any "unpredictable" algorithm switching - use single threaded static and external threading, please. Currently we are considering full removal of OMP code from IPP functions - threading at the primitive level is not so efficient as at the aplication level - DMIP sample proves this statement on 200%.

Regards,
Igor