- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.
This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.
Can someone suggest what might be going on here?
Paul Gauthier
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Other than the image processing process, is your application running other threads such as C# GUI or a control task? If this is the case then your other tasks might be sharing a core with IPP's image process threads, blocking each other (all depands on your application architecture and flow).
In such case you can only "feel" the speed up on quad core and above, where yourun IPP on separate cores from your other tasks cores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.
...
[SergeyK] Did you try to call aSetThreadAffinityMask Win32 API function for both threads? Ifa Thread1works on a CPU1, and a Thread2works on CPU2 there must be aperformance improvement.
This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.
Can someone suggest what might be going on here?
A simpleTest-Case would help to identify a problem.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a. Image supplier thread: are the images supplied by a camera, is the camera using a callback that might consume CPU time (remember its the same application so its hard to detect on Task Manager)
b. Not all IPPI fuctions are multithreaded, I would recheck the documentation.
good luck
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey, I've not called SetThreadAffinityMask but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.
It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.
Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?
Paul G.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey, I've not called SetThreadAffinityMask but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.
[SergeyK] Don't forget to call 'Sleep(0)' just right after a call to 'SetThreadAffinityMask(...)'
because a CPU needs some time.
It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.
[SergeyK] It would be nice to hear some technical detailsfrom IPP's software developers.
Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?
[SergeyK] I don't think so. I used the same technique to doa linear algebraprocessing for a
matrixon twoCPUs.
Paul G.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) IPP functions don't use Enter/LeaveCriticalSection
2) could you provide a list of functions you use - not all IPP functions have internal threading
Regards,
Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's the info about the IPPI lib I'm using:
CPU : v8
Name : ippiv8-7.0.dll
Version : 7.0 build 205.85
Build date: Nov 26 2011
I'm using the following IPP calls:
ippiSub_32f_C1R
ippiSqr_32f_C1R
ippiAdd_32f_C1IR
ippiSqrt_32f_C1R
I've included my little test program below. When I set the number of cores to use to 2 (on my 2-core system) I get the following output:
Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 2 processors
2000 Iterations in 30.086 seconds (15.043 msec per iteration)
When I set the number of cores to use to 1, I get the following output:
Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 1 processors
2000 Iterations in 30.251 seconds (15.1255 msec per iteration)
As you can see, in both cases the average time per loop is about the same.
__________________________________
Here's the entire test program. It just allocates some 1280x960 images and then calculates:
Result = sqrt( sqr(A - B) + sqr(C - D) )
__________________________________________________________
int _tmain(int argc, _TCHAR* argv[])
{
int Width = 1280, Height = 960;
IppiSize iSize; iSize.width = Width; iSize.height = Height;
float* pIm[8];
for (int i = 0 ; i < 9 ; i++)
{
pIm = (float*)new float[Width * Height];
ippiSet_32f_C1R((float)100, pIm, Width*sizeof(float), iSize);
}
libInfo();
ippInit();
IppStatus sts;
sts = ippInitCpu(ippCpuC2D);
cout << "Ipp init: " << ippGetStatusString( sts ) << endl;
cout << "Number of cores = " << ippGetNumCoresOnDie() << endl;
int NumProcessors = 1;
ippSetNumThreads(NumProcessors);
ippGetNumThreads(&NumProcessors);
cout << "Testing with " << NumProcessors << " processors" << endl;
clock_t tStart = clock();
int NumRepeats = 2000;
for (int i = 0 ; i < NumRepeats ; i++)
{
// Ref1 = Ref1 = A - B
ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);
// Ref1 = Ref2 = C - D
ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);
// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)
ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);
ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
// Result = Sqrt(Temp1 + Temp2)
ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);
}
clock_t tEnd = clock();
double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;
double tMsec = 1000.0*tSec / NumRepeats;
cout << NumRepeats <<" Iterations in " << tSec << " seconds (" << tMsec << " msec per iteration)" << endl;
getchar();
for (int i = 0 ; i < 9 ; i++)
delete[] pIm;
return 0;
}
Paul G.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll test your example and come back soon,
attached is the list of threaded functions - all functions you use are threaded - so I'll try to find the problem.
(sorry, don't know how to attach txt file - so only few lines)
...............
ippiAddC_16s_C1IRSfs
ippiAddC_16s_C1RSfs
ippiAddC_16s_C3IRSfs
ippiAddC_16s_C3RSfs
ippiAddC_32f_AC4IR
ippiAddC_32f_C1R
ippiAddC_32f_C3IR
ippiAddC_32f_C3R
ippiAddC_8u_C1IRSfs
ippiAddC_8u_C1RSfs
ippiAddC_8u_C3IRSfs
ippiAddC_8u_C3RSfs
ippiAddProduct_16u32f_C1IMR
ippiAddProduct_16u32f_C1IR
ippiAddProduct_32f_C1IMR
ippiAddProduct_32f_C1IR
ippiAddProduct_8s32f_C1IMR
ippiAddProduct_8s32f_C1IR
ippiAddProduct_8u32f_C1IMR
ippiAddProduct_8u32f_C1IR
ippiAddSquare_16u32f_C1IMR
ippiAddSquare_16u32f_C1IR
ippiAddSquare_32f_C1IMR
ippiAddSquare_32f_C1IR
ippiAddSquare_8s32f_C1IMR
ippiAddSquare_8s32f_C1IR
ippiAddSquare_8u32f_C1IMR
ippiAddSquare_8u32f_C1IR
ippiAddWeighted_16u32f_C1IMR
ippiAddWeighted_16u32f_C1IR
ippiAddWeighted_32f_C1IMR
ippiAddWeighted_32f_C1IR
ippiAddWeighted_8s32f_C1IMR
ippiAddWeighted_8s32f_C1IR
ippiAddWeighted_8u32f_C1IMR
ippiAddWeighted_8u32f_C1IR
ippiAdd_16s_C1IRSfs
ippiAdd_16s_C1RSfs
ippiAdd_16s_C3IRSfs
ippiAdd_16s_C3RSfs
ippiAdd_16s_C4IRSfs
ippiAdd_16s_C4RSfs
ippiAdd_16u32f_C1IMR
ippiAdd_16u32f_C1IR
ippiAdd_32f_C1IMR
ippiAdd_32f_C1IR
ippiAdd_32f_C1R
ippiAdd_32f_C3IR
ippiAdd_32f_C3R
ippiAdd_32f_C4IR
ippiAdd_32f_C4R
ippiAdd_8s32f_C1IMR
ippiAdd_8s32f_C1IR
ippiAdd_8u32f_C1IMR
ippiAdd_8u32f_C1IR
ippiAdd_8u_C1IRSfs
ippiAdd_8u_C1RSfs
ippiAdd_8u_C3IRSfs
ippiAdd_8u_C3RSfs
ippiAdd_8u_C4IRSfs
ippiAdd_8u_C4RSfs
.............
ippiSqrt_16s_AC4IRSfs
ippiSqrt_16s_AC4RSfs
ippiSqrt_16s_C1IRSfs
ippiSqrt_16s_C1RSfs
ippiSqrt_16s_C3IRSfs
ippiSqrt_16s_C3RSfs
ippiSqrt_16u_AC4IRSfs
ippiSqrt_16u_AC4RSfs
ippiSqrt_16u_C1IRSfs
ippiSqrt_16u_C1RSfs
ippiSqrt_16u_C3IRSfs
ippiSqrt_16u_C3RSfs
ippiSqrt_32f_AC4IR
ippiSqrt_32f_AC4R
ippiSqrt_32f_C1IR
ippiSqrt_32f_C1R
ippiSqrt_32f_C3IR
ippiSqrt_32f_C3R
ippiSqrt_32f_C4IR
ippiSqrt_8u_AC4IRSfs
ippiSqrt_8u_AC4RSfs
ippiSqrt_8u_C1IRSfs
ippiSqrt_8u_C1RSfs
ippiSqrt_8u_C3IRSfs
ippiSqrt_8u_C3RSfs
ippiSub128_JPEG_8u16s_C1R
ippiSubC_16s_C1IRSfs
ippiSubC_16s_C1RSfs
ippiSubC_16s_C3IRSfs
ippiSubC_16s_C3RSfs
ippiSubC_32f_AC4IR
ippiSubC_32f_C1R
ippiSubC_32f_C3IR
ippiSubC_32f_C3R
ippiSubC_8u_C1IRSfs
ippiSubC_8u_C1RSfs
ippiSubC_8u_C3IRSfs
ippiSubC_8u_C3RSfs
ippiSub_16s_C1IRSfs
ippiSub_16s_C1RSfs
ippiSub_16s_C3IRSfs
ippiSub_16s_C3RSfs
ippiSub_16s_C4IRSfs
ippiSub_16s_C4RSfs
ippiSub_32f_C1IR
ippiSub_32f_C1R
ippiSub_32f_C3IR
ippiSub_32f_C3R
ippiSub_32f_C4IR
ippiSub_32f_C4R
ippiSub_8u_C1IRSfs
ippiSub_8u_C1RSfs
ippiSub_8u_C3IRSfs
ippiSub_8u_C3RSfs
ippiSub_8u_C4IRSfs
ippiSub_8u_C4RSfs
..............................Regards,
Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In example 9 float type images is processed. They have sizes 1280 * 960. This is too much to all these images accommodate into cache at the same time. On the other part arithmetic functions dont require great calculating resources. So the bottleneck for this example is memory access and the paralleling doesnt lead to performance improvement.
However if we begin to process images sequentially by small pieces block-by-block we can score a success.
I decreased image sizes and number of images. Also I changed the code slightly.
After that we can see performance improvement for increment of number of threads.
Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 1 processors
2000 Iterations in 0.38 seconds (0.19 msec per iteration)
Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 2 processors
2000 Iterations in 0.22 seconds (0.11 msec per iteration)
The code with changes:
int _tmain(int argc, _TCHAR* argv[])
{
// int Width = 1280, Height = 960;
int Width = 4096, Height = 16;
IppiSize iSize; iSize.width = Width; iSize.height = Height;
float* pIm[8];
for (int i = 0 ; i < 9 ; i++)
{
pIm = (float*)new float[Width * Height];
ippiSet_32f_C1R((float)(100-i), pIm, Width*sizeof(float), iSize);
}
// libInfo();
ippInit();
IppStatus sts;
sts = ippInitCpu(ippCpuC2D);
cout << "Ipp init: " << ippGetStatusString( sts ) << endl;
cout << "Number of cores = " << ippGetNumCoresOnDie() << endl;
int NumProcessors = 1; /*=2;*/
ippSetNumThreads(NumProcessors);
ippGetNumThreads(&NumProcessors);
cout << "Testing with " << NumProcessors << " processors" << endl;
clock_t tStart = clock();
int NumRepeats = 2000;
/*
for (int i = 0 ; i < NumRepeats ; i++)
{
// Ref1 = Ref1 = A - B
ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);
// Ref1 = Ref2 = C - D
ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);
// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)
ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);
ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
// Result = Sqrt(Temp1 + Temp2)
ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);
}
*/
for (int i = 0 ; i < NumRepeats ; i++)
{
// Ref1 = Ref1 = A - B
ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);
// Ref1 = Ref2 = C - D
ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);
// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)
/*
ippiSqr_32f_C1IR(pIm[2], Width*sizeof(float), iSize);
ippiSqr_32f_C1IR(pIm[5], Width*sizeof(float), iSize);
*/
ippiMul_32f_C1IR(pIm[2], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);
ippiMul_32f_C1IR(pIm[5], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);
// Result = Sqrt(Temp1 + Temp2)
ippiAdd_32f_C1IR(pIm[2], Width*sizeof(float), pIm[5], Width*sizeof(float), iSize);
ippsSqrt_32f_I(pIm[5],Width*Height);
}
clock_t tEnd = clock();
double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;
double tMsec = 1000.0*tSec / NumRepeats;
cout << NumRepeats <<" Iterations in " << tSec << " seconds (" << tMsec << " msec per iteration)" << endl;
getchar();
for (int i = 0 ; i < 9 ; i++)
delete[] pIm;
return 0;
}
Thanks,
Ivan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It also showed me that different IPP functions that do basically the same thing can take significantly shorter time to execute (ippsSqrt_32f_I() compared to ippiSqrt_32f_C1R(), for example).
I'll experiment with processing our images in small blocks that can fit in the cache.
Paul G.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
there are several problems: 1) your code uses 9 images/memory buffers - 5 Mbyte each - so as operations are too simple (add, sub, sqr, sqrt) but for each you need to perform 1 load and 1 store - all load in concentrated around memory bus - and you know - you can't speedup "copy" operation with multiple threads - you have only one memory bus; this means that you should optimize your code for cache size and data reuse - therefore you should perform processing by rather small slices. 2) ippiSqr is not threaded - this is why Ivan used Mul instead of Sqr - imagine that Sub is threaded - so work is divided between 2 CPUs and data - between their caches; then you call Sqr - it is not threaded and therefore all data is processed by 1 CPU - so all data from cache of 2nd CPU must be transfered to c=the cache of the 1st CPU; then you perform Add operation - it is threaded - that means that all data again must be spreaded between 2 caches... Mul is threaded; 3) ippiSqrt is marked as threaded - but it is not fully so - 2D Sqrt is based on 1D Sqrt (row by row) - and 2D Sqrt doesn't have special 2D threading - threaded is only 1D Sqrt - and it (1D Sqrt) has internal criterion ==4K - so it is threaded for vectors >= 4K - this is why Ivan used directly 1D Sqrt - to guarantee that threaded code works; 4) for your case the best approach (from the performance point of view) is to redevelop your code as a loop row by row, link with non threaded static IPP lib and to use #pragma parallel for before the loop - threading at the primitive level is not so efficient as at the application level - this is why we are promoting DMIP and are going to remove (deprecate) threading at the primitive level in IPP 8.0
Regards,
Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I agree that disabling the IPP threading and doing the threading in our own application is the better approach for us with our big images. When I do that for one of our functions I get a minor speedup (5%) with the two processors in my test system. Since both external threads are calling the same IPPI functions to process different slices of the same image, I think the cache limitations come in to play.
Thanks for your help.
Paul
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page