Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Dual core speed increasing

michael_kube
Beginner
461 Views
Hello,

we are using many image function auf the IPP to calculate (ippiAdd etc.). The library is loaded dynamic.

We could not measure any difference between a single pentium system and a dual core system. Even when I restrict the number of using processors ippSetNumThreads( 1 ), the calculation time is the same like ippSetNumThreads( 2 ).

Best regards, Michael
0 Kudos
5 Replies
Vladimir_Dudnik
Employee
461 Views

Hello Michael,

first of all, not all IPP functions use threading inside. And second, usingof internal threading is really depends on many factors, like data size, target processor and OS. So, most probably you just looking at functions which does not use threading.

Regards,
Vladimir

0 Kudos
Zhe_F_
Beginner
461 Views

I ran into a similar issue.

In the text file ThreadedFunctionsList.txt that comes with 5.3.2, it has following functions

==========================================

ippiAdd_8u/16s_C1RSfs/C3RSfs/C4RSfs/AC4RSfs
ippiAdd_8u/16s_C1IRSfs/C3IRSfs/C4IRSfs/AC4IRSfs
ippiAdd_32f_C1R/C3R/C4R/AC4R
ippiAdd_32f_C1IR/C3IR/C4IR/AC4IR

==========================================

which confirms that ippiAdd... are threaded functions.

I also did a test using ippiAddWeighted_8u32f_C1IR that is also a threaded function according to the ThreadedFunctionsList.txt.

However, I got almost identical performance results using either 4 threads or 1 threads, using dynamic library or static library.

The function wasfed witha 640 by 480 image to it and run 10000 times to get the time results.

Here are the codes
======================================================
 int nt = -1;
int iw = 640;
int ih = 480;
int n = 10000;
Image32f img0;
img0.InitAlloc( iw, ih );
img0.Fill( 1.0);
Ipp32f alpha = 0.01f;
Image8u imgSrc( iw, ih, 1 );
imgSrc.Fill( 1 );
ippGetNumThreads( &nt );
IppStatus s = ippSetNumThreads(4 );
Ipp64u start = ippGetCpuClocks();
clock_t cstart, cend;
cstart = clock();
for( int i = 0; i < n; i++ )
{
s = ippiAddWeighted_8u32f_C1IR(imgSrc.GetDataPtr(), imgSrc.GetStride(),
img0.GetDataPtr(), img0.GetStride(),
imgSrc.GetSize(), alpha );
}
cend = clock();
Ipp64u end = ippGetCpuClocks();
cout<<"number of threads = 4: "<< end - start<<";"<< cend - cstart<
ippGetNumThreads( &nt );
s = ippSetNumThreads( 1 );
start = ippGetCpuClocks();
cstart = clock();
for( int i = 0; i < n; i++ )
{
s = ippiAddWeighted_8u32f_C1IR(imgSrc.GetDataPtr(), imgSrc.GetStride(),
img0.GetDataPtr(), img0.GetStride(),
imgSrc.GetSize(), alpha );
}
cend = clock();
end = ippGetCpuClocks();
cout<<"number of threads = 1: "<< end - start<<";"<< cend - cstart<
return 0;
========================================================
Did I miss out something simple?
Mike
0 Kudos
Zhe_F_
Beginner
461 Views
forgot to mention that my machinehas a single quad-core CPU running windows xp.
0 Kudos
nizanh
Beginner
461 Views
You shouldn't expect to see much (or any) performance gain from running ippiAdd on >1 cores. Add operation is a very fast one, especially when done using SSE instructions. The CPU spend on average less than one cycle per pixel. This means that this operation is limited by data access, not by CPU.
It could be that if all the relevant data will fit on L2 cache you will see some improvement.
0 Kudos
peter03436
Beginner
461 Views
If you're interested in seeing the threading work, try a more computationally intensive routine as has been suggested. Normalized Cross Correlation should work for this purpose. I have dual quad cores running linux, but here is what you can expect for various number of threads running a 640x480 with a 38x41 template for correlation

NCC

Threads 1 2 3 4 5 6 7 8

Time(ms) 15.3 8.9 6.2 5.1 4.6 4.2 3.6 3.3


0 Kudos
Reply