Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Static Lib FASTER than DLL Lib?

Intel_C_Intel
Employee
299 Views

I'm running tests on our application in three different configurations, and I'm actually seeing it run faster when I link with the static libraries. The application itself is multi-threaded (and the documentation recommends that I use link with the DLL version in this case, but as I said, the application is working and runs faster with the static libs). Tests are run on a system with 8 CPUs.

Fastest: Config 1: Static libs - link with 'merged' and 'emerged' libraries in the lib dir, calling ippStaticInit at start. When this runs, I see one or two CPUs at 100% (as expected given the application is multi-threaded and one or two of the threads have most of the work to do) and an overall CPU load of ~35%.

Slower: Config 2: DLL libs - link with libs in stublib sub-dir, IPP dlls are in path, no call to ippStaticInit, environment variable OMP_NUM_THREADS is NOT set. When this runs, the IPP reports that 8 threads are running (ippGetThreads result), all the CPUs are at 100% and the overall CPU load is (of course) 100% - however, total processing time INCREASES for the same data (used with all configs).

Slower still: Config 3: same as Config 2, except that OMP_NUM_THREADS is set to 1. This time, the IPP again reports 8 threads, however allCPUs hover around 96% (producing a similar 'overall' usage percent).

My guess (unless I'm doing something wrong - and please let me know if so) why the DLL configs are slower than the static one is that the IPP functions I'm using have not been optimized for the multi-threaded support provided by the IPP DLLs, but the threads are created in any case by the DLL. It is their thrashing, looking for something to do, that increases the overall processing load, and increases my total processing time.

For Config 3, the DLLs seems to have still created 8 threads (since that was reported by ippGetThreads and I could still see all 8 CPUs working hard) but only 1 of them was allowed to do work (as determined by setting OMP_NUM_THREADS to 1).

Can anyone confirm this theory? Please let me know what you think, I'd certainly like to see a performance boost from using the DLL versions.

0 Kudos
3 Replies
Vladimir_Dudnik
Employee
299 Views

Hello, could you please specify what is your IPP version, what is processors anmd operatiing system you are running on?

It also important to specify which IPP functions do you use. It would be nice if you can provide a simple test case.

Regards,
Vladimir

0 Kudos
astrasel
Beginner
299 Views

The app is running on Windows XP professional, version 2002, service pack 2.

This is output from the IPP Version Structure:
Version Numbers: 5.2.108.412
Target CPU: v8
Name: ippsv8-5.2.dll
Version: 5.2
Build date: Apr 4 2007

We use primarily functions from the signal processing library. I can't easily produce a sample application to post here.

Before I try that, I'd like to get an answer to the basic question of whether the theory is plausible -could the extra overhead of the 'thread pool', combined with using function that have not been optimized to use threading and/or relatively small vectors for processing actually make the static library faster than the DLL version?

0 Kudos
Vladimir_Dudnik
Employee
299 Views

There is my thought about that:
Threading in IPP is implemented through OpenMP API. That API allocate threads starting from the first logicalprocessor till the thread limit you specify or number of available processors (you can limit number of threads to be spawned by IPP with ippSetNumThreads call). Then if you want to additionally parallelize your application on top of IPP with using system threading API (and you have enough number of processors, like 8 in your case) you may want to limit IPP with number of threads(for example4 for IPP and rest 4 for your application) and launch your own threads by explicitely linking them to the highest logical processors with using appropriate affinity mask.

If you parallelize your application also with OpenMP API, there is a chance that your threads will compete with IPP threads for the same logical processors whichwillresult in slowdown.

When you try to launch more threads (including thosespawned by IPP internally)then number of available logical processors you willalso get some slowdown as a result.

Vladimir

0 Kudos
Reply