first of all excuse me please for my English.
I am develop some graphics processing application and one of parts of this application it is compressing images by JPEG codec.
I try to use uic JPEG codec from latest IPP 7.0.5 and obtain very good results on my middle entry desktop PC with Intel i5-2300 CPU.
Testing JPEG compression performance on 1920x1080 RGB24 source image show next results:
using 1 compression threads - approx. 55 fps
using 2 compression threads - approx. 95 fps
using 3 compression threads - approx. 95 fps
using 4 compression threads - approx. 91 fps
This results shown very good compression performance and also they shown that it is useless to make compression with more that 2 threads (but in all cases Task Manager shown that CPU cores was uploaded to 100% so in case 3 and 4 threads, they do useless work :-)).
But problem begins when I make tests on target server with dual Xeon E5620 CPUs.
Same testing program which run with same source image shown next results (if all threads run on 1 of CPU):
using 1 compression threads - approx. 37 fps
using 2 compression threads - approx. 46 fps
using 3 compression threads - approx. 50 fps
using 4 compression threads - approx. 51 fps
using more that 4 thread shown continuous slowly down results from 50 to 35-40 fps...
Also, if even one threads run on another CPU then results become more badly (slowdown approx. 10 fps).
Turning off HyperThreading in BIOS slightly improve results, they grow up on approx. 10 fps, but they still 2 times badly that on consumer i5 CPU... :-(
So, my question is, are this results expected and normal, or I make something wrong?
I am expected, that 2.5 times more expensive Xeon CPU will show best results that regular i5...
I am can understand performance slowdown when working threads run on different physical CPUs (memory acces issues and so on), but when I make test in same conditions on only one of Xeon CPU (using SetProcessAffinityMask) why they 2 times more slowly that i5?
E5620 has 12Mb cache and source 1920x1080 image only 6Mb, so whole image can be simply placed in cache...
Thank you in advance.
PS: I compile uic JPEG codec with latest Intel C++ Compiler XE for applications running on IA-32, Version 188.8.131.528 Build 20111011 with /Ox /QaxSSE4.1 /QxSSE4.1 /Qparallel /Qopenmp switches.
My desktop PC run under Windows 7 professional 32 bit.
My server run under Windows Server 2008 R1 32 bit.
I use ippSetNumThreads() function to set number of processing threads.
Thanks for your report. There are some questions on how the threading is used in your application. For UIC JPEG encoding, the code is threaded at the sample code level. It looks to me that you are also trying to use other level of threading: ippSetNumThreads() function to set number of processing threads, This function is used to set the internal threading of IPP level functions. If you are using threading at the sample code level, it is suggested disabling the threading at the level function.
Also it looks that you are trying to thread at your application by add some compiler switch /Qparallel. This is also need to be careful that this will not create overthreading with the threading at the sample code level.
How the is linked with IPP libraries, statically, or dynamically? If it is statically link, it needs to call ippInit() function to select correct dispatching code.
thank you for the reply.
As far as I understood, multithreading of UIC JPEG encoder at the sample code level done by using Openmp. And if I compile this codec without /Qopenmp switch, then performance almost doesn't depend from using more that one thread using ippSetNumThreads().
About /Qparallel switch, I can't check how it affect on performance because I haven't access to server just now, but I try do it at nearest few days.
I am link libraries statically and I know well about ippInit() function and I call it.
Also, I tried to link dynamically, to exclude possible statically linking issues, but got exactly same results.
Can you tell me, are the computing performance of the Xeon E5620 CPU significantly more that Core I5-2300 CPU and my problem lays in plane of the wrong usage of the multithreading or compiler switches, or they performances are REALLY as shown my tests and all future investigations are useless waste of time?
>>...using more that 4 thread shown continuous slowly down results from 50 to 35-40 fps...
In the 2nd case '...if all threads run on 1 of CPU...' there are more context switches onthat CPU.
>>...if even one threads run on another CPU then results become more badly
>>(slowdown approx. 10 fps)...
vTune applicationallows tomonitor athreading performance of an applicationand it allows to
see how many context switches are donebetween threads, etc.
>>...I use ippSetNumThreads() function to set number of processing threads...
How many? Also, since you compiled the project with OpenMP support it also creates some number
of threads, right?
I would try to get numbers for how many threads are actually created in both cases.