Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Limited Multi-threaded performace gain

Patrick_R_1
Beginner
217 Views

CPU: Intel (R) Xeon(R) X5660 2.8 GHz (Westmere 2 Hex-core with Hyper-threading)

OS: Linux CentOS 5

IPP: Version 6.0

I have a small multi-threaded C++ application that uses IPP in single-threaded mode -  ippSetNumThreads(1).  The application has N workers which each execute the same series of instructions - including some IPP calls (p8_ownsMul_32fc, p8_ownsAdd_32f_I, p8_ippsTone_Direct_32fc).  There is a configurable sized thread pool the N workers can use to help complete the work in a multi-threaded manner.  I have timed how long the N workers take to complete the work with various number of threads.  What's interesting is that the best performance (shortest completion time) was with a thread pool sized to 6 - speed up of ~ 5 times.  Adding more threads beyond that did not improve performance, i.e., with 12 threads I got a speed up of only 2.45 times.  Using a CPU profiler, it shows the additional time is spent in the IPP calls.

I have run this application without using IPP and I have seen performance scale almost linearly as I increase the number of threads.  Am I configuring IPP properly for the hardware I'm using?  Is their any limitations with IPP that would prevent performance increase with more than 6 application threads each making the same IPP calls (such as any mutex/locks within IPP) ? 

0 Kudos
3 Replies
Chao_Y_Intel
Moderator
217 Views

Hello,

How are Intel IPP linked in the application? If you link with none-threaded version of Intel IPP (not by setting ippSetNumThreads(), which may still reply on the OpenMP libraries), will this problem still happen for you?

Thanks,
Chao

0 Kudos
Patrick_R_1
Beginner
217 Views

I am linking with the non-threaded static merged libraries.

 

0 Kudos
Sergey_K_Intel
Employee
217 Views

Hi Patrick,

The IPP functions don't contain any multi-threading synchronization objects inside. Especially the single-thread version of IPP library.

The only constraint I could see in your case is the amount of data you work with in the worker thread, The X5660 CPU has 12M of CPU cache which is shared between hardware cores.

If the total amount of working data (source/destination/temporary arrays, local data, etc.) within the worker threads is inside these limits, the scalability of performance should be ok.  If more, the application will spend more and more time waiting for the data to come.

Another point of concern could be the dynamic memory operations in the threads. The standard allocs (including IPP ippMalloc functions) are serialized, i.e. calling of malloc/free is another point of inter-thread synchronization, which in some cases makes threads to wait for each other.

Regards,
Sergey

0 Kudos
Reply