External Multi-threading not working for IPPI functions

paulsgauthier · ‎01-06-2012

I'm trying to speed up our image processing software that uses some image arithmetic functions from IPPI. I first attempted to upgrade to the latest IPP (7.0) and let it do the OpenMP threading internally. That did not work. On a Core 2 duo processor (Win7) I got no speed up at all for two threads over one (although Task Manager showed that both hardware processors are pegged at 100%). I followed all the suggstions that I could find from this forum but nothing worked.

So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.

This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.

Can someone suggest what might be going on here?

Paul Gauthier

OKohl · ‎01-06-2012

Other than the image processing process, is your application running other threads such as C# GUI or a control task? If this is the case then your other tasks might be sharing a core with IPP's image process threads, blocking each other (all depands on your application architecture and flow).
In such case you can only "feel" the speed up on quad core and above, where yourun IPP on separate cores from your other tasks cores.

paulsgauthier · ‎01-06-2012

Thanks for the reply. Yes, it's true the two hardware processors are used by many other threads, in our application as well as in the operating system. But while the image processing is going on there are no cpu-intensive activities happening. The GUI thread is waiting for the user to press a button. According to Task Manager there are no other applications popping up to steal time.

SergeyKostrov · ‎01-06-2012

Quoting paulsgauthier

...
So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.
...
[SergeyK] Did you try to call aSetThreadAffinityMask Win32 API function for both threads? Ifa Thread1works on a CPU1, and a Thread2works on CPU2 there must be aperformance improvement.

This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.

Can someone suggest what might be going on here?

A simpleTest-Case would help to identify a problem.

Best regards,
Sergey

OKohl · ‎01-07-2012

Two more things I would check:

a. Image supplier thread: are the images supplied by a camera, is the camera using a callback that might consume CPU time (remember its the same application so its hard to detect on Task Manager)

b. Not all IPPI fuctions are multithreaded, I would recheck the documentation.

good luck

paulsgauthier · ‎01-07-2012

Thanks OKohl, the images are already in memory. There's no camera callback involved. Also, the functions I'm calling are listed in the threaded function list.

Sergey, I've not called SetThreadAffinityMask but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.

It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.

Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?

Paul G.

SergeyKostrov · ‎01-07-2012

Quoting paulsgauthier

Thanks OKohl, the images are already in memory. There's no camera callback involved. Also, the functions I'm calling are listed in the threaded function list.

Sergey, I've not called SetThreadAffinityMask but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.

[SergeyK] Don't forget to call 'Sleep(0)' just right after a call to 'SetThreadAffinityMask(...)'
because a CPU needs some time.

It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.

[SergeyK] It would be nice to hear some technical detailsfrom IPP's software developers.

Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?

[SergeyK] I don't think so. I used the same technique to doa linear algebraprocessing for a
matrixon twoCPUs.

Paul G.

Best regards,
Sergey

igorastakhov · ‎01-08-2012

Hi,

1) IPP functions don't use Enter/LeaveCriticalSection
2) could you provide a list of functions you use - not all IPP functions have internal threading

Regards,
Igor

paulsgauthier · ‎01-08-2012

Igor,

Here's the info about the IPPI lib I'm using:

CPU : v8
Name : ippiv8-7.0.dll
Version : 7.0 build 205.85
Build date: Nov 26 2011

I'm using the following IPP calls:

ippiSub_32f_C1R
ippiSqr_32f_C1R
ippiAdd_32f_C1IR
ippiSqrt_32f_C1R

I've included my little test program below. When I set the number of cores to use to 2 (on my 2-core system) I get the following output:

Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 2 processors
2000 Iterations in 30.086 seconds (15.043 msec per iteration)

When I set the number of cores to use to 1, I get the following output:

Ipp init: ippStsNoErr: No error, it's OK
Number of cores = 2
Testing with 1 processors
2000 Iterations in 30.251 seconds (15.1255 msec per iteration)

As you can see, in both cases the average time per loop is about the same.

__________________________________

Here's the entire test program. It just allocates some 1280x960 images and then calculates:

Result = sqrt( sqr(A - B) + sqr(C - D) )

__________________________________________________________

int _tmain(int argc, _TCHAR* argv[])
{
int Width = 1280, Height = 960;
IppiSize iSize; iSize.width = Width; iSize.height = Height;
float* pIm[8];
for (int i = 0 ; i < 9 ; i++)
{
pIm = (float*)new float[Width * Height];
ippiSet_32f_C1R((float)100, pIm, Width*sizeof(float), iSize);
}
libInfo();
ippInit();
IppStatus sts;
sts = ippInitCpu(ippCpuC2D);
cout << "Ipp init: " << ippGetStatusString( sts ) << endl;
cout << "Number of cores = " << ippGetNumCoresOnDie() << endl;
int NumProcessors = 1;
ippSetNumThreads(NumProcessors);
ippGetNumThreads(&NumProcessors);
cout << "Testing with " << NumProcessors << " processors" << endl;
clock_t tStart = clock();
int NumRepeats = 2000;
for (int i = 0 ; i < NumRepeats ; i++)
{
// Ref1 = Ref1 = A - B
ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);
// Ref1 = Ref2 = C - D
ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);
// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)
ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);
ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
// Result = Sqrt(Temp1 + Temp2)
ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);
ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);
}
clock_t tEnd = clock();
double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;
double tMsec = 1000.0*tSec / NumRepeats;
cout << NumRepeats <<" Iterations in " << tSec << " seconds (" << tMsec << " msec per iteration)" << endl;
getchar();
for (int i = 0 ; i < 9 ; i++)
delete[] pIm;
return 0;
}

Paul G.

igorastakhov · ‎01-09-2012

thank you, Paul,

I'll test your example and come back soon,
attached is the list of threaded functions - all functions you use are threaded - so I'll try to find the problem.

(sorry, don't know how to attach txt file - so only few lines)
...............

ippiAddC_16s_C1IRSfs

ippiAddC_16s_C1RSfs

ippiAddC_16s_C3IRSfs

ippiAddC_16s_C3RSfs

ippiAddC_32f_AC4IR

ippiAddC_32f_C1R

ippiAddC_32f_C3IR

ippiAddC_32f_C3R

ippiAddC_8u_C1IRSfs

ippiAddC_8u_C1RSfs

ippiAddC_8u_C3IRSfs

ippiAddC_8u_C3RSfs

ippiAddProduct_16u32f_C1IMR

ippiAddProduct_16u32f_C1IR

ippiAddProduct_32f_C1IMR

ippiAddProduct_32f_C1IR

ippiAddProduct_8s32f_C1IMR

ippiAddProduct_8s32f_C1IR

ippiAddProduct_8u32f_C1IMR

ippiAddProduct_8u32f_C1IR

ippiAddSquare_16u32f_C1IMR

ippiAddSquare_16u32f_C1IR

ippiAddSquare_32f_C1IMR

ippiAddSquare_32f_C1IR

ippiAddSquare_8s32f_C1IMR

ippiAddSquare_8s32f_C1IR

ippiAddSquare_8u32f_C1IMR

ippiAddSquare_8u32f_C1IR

ippiAddWeighted_16u32f_C1IMR

ippiAddWeighted_16u32f_C1IR

ippiAddWeighted_32f_C1IMR

ippiAddWeighted_32f_C1IR

ippiAddWeighted_8s32f_C1IMR

ippiAddWeighted_8s32f_C1IR

ippiAddWeighted_8u32f_C1IMR

ippiAddWeighted_8u32f_C1IR

ippiAdd_16s_C1IRSfs

ippiAdd_16s_C1RSfs

ippiAdd_16s_C3IRSfs

ippiAdd_16s_C3RSfs

ippiAdd_16s_C4IRSfs

ippiAdd_16s_C4RSfs

ippiAdd_16u32f_C1IMR

ippiAdd_16u32f_C1IR

ippiAdd_32f_C1IMR

ippiAdd_32f_C1IR

ippiAdd_32f_C1R

ippiAdd_32f_C3IR

ippiAdd_32f_C3R

ippiAdd_32f_C4IR

ippiAdd_32f_C4R

ippiAdd_8s32f_C1IMR

ippiAdd_8s32f_C1IR

ippiAdd_8u32f_C1IMR

ippiAdd_8u32f_C1IR

ippiAdd_8u_C1IRSfs

ippiAdd_8u_C1RSfs

ippiAdd_8u_C3IRSfs

ippiAdd_8u_C3RSfs

ippiAdd_8u_C4IRSfs

ippiAdd_8u_C4RSfs

.............

ippiSqrt_16s_AC4IRSfs

ippiSqrt_16s_AC4RSfs

ippiSqrt_16s_C1IRSfs

ippiSqrt_16s_C1RSfs

ippiSqrt_16s_C3IRSfs

ippiSqrt_16s_C3RSfs

ippiSqrt_16u_AC4IRSfs

ippiSqrt_16u_AC4RSfs

ippiSqrt_16u_C1IRSfs

ippiSqrt_16u_C1RSfs

ippiSqrt_16u_C3IRSfs

ippiSqrt_16u_C3RSfs

ippiSqrt_32f_AC4IR

ippiSqrt_32f_AC4R

ippiSqrt_32f_C1IR

ippiSqrt_32f_C1R

ippiSqrt_32f_C3IR

ippiSqrt_32f_C3R

ippiSqrt_32f_C4IR

ippiSqrt_8u_AC4IRSfs

ippiSqrt_8u_AC4RSfs

ippiSqrt_8u_C1IRSfs

ippiSqrt_8u_C1RSfs

ippiSqrt_8u_C3IRSfs

ippiSqrt_8u_C3RSfs

ippiSub128_JPEG_8u16s_C1R

ippiSubC_16s_C1IRSfs

ippiSubC_16s_C1RSfs

ippiSubC_16s_C3IRSfs

ippiSubC_16s_C3RSfs

ippiSubC_32f_AC4IR

ippiSubC_32f_C1R

ippiSubC_32f_C3IR

ippiSubC_32f_C3R

ippiSubC_8u_C1IRSfs

ippiSubC_8u_C1RSfs

ippiSubC_8u_C3IRSfs

ippiSubC_8u_C3RSfs

ippiSub_16s_C1IRSfs

ippiSub_16s_C1RSfs

ippiSub_16s_C3IRSfs

ippiSub_16s_C3RSfs

ippiSub_16s_C4IRSfs

ippiSub_16s_C4RSfs

ippiSub_32f_C1IR

ippiSub_32f_C1R

ippiSub_32f_C3IR

ippiSub_32f_C3R

ippiSub_32f_C4IR

ippiSub_32f_C4R

ippiSub_8u_C1IRSfs

ippiSub_8u_C1RSfs

ippiSub_8u_C3IRSfs

ippiSub_8u_C3RSfs

ippiSub_8u_C4IRSfs

ippiSub_8u_C4RSfs

..............................

Regards,
Igor

Ivan_Z_Intel · ‎01-11-2012

Hi!

In example 9 float type images is processed. They have sizes 1280 * 960. This is too much to all these images accommodate into cache at the same time. On the other part arithmetic functions dont require great calculating resources. So the bottleneck for this example is memory access and the paralleling doesnt lead to performance improvement.

However if we begin to process images sequentially by small pieces block-by-block we can score a success.

I decreased image sizes and number of images. Also I changed the code slightly.

After that we can see performance improvement for increment of number of threads.

Ipp init: ippStsNoErr: No error, it's OK

Number of cores = 2

Testing with 1 processors

2000 Iterations in 0.38 seconds (0.19 msec per iteration)

Ipp init: ippStsNoErr: No error, it's OK

Number of cores = 2

Testing with 2 processors

2000 Iterations in 0.22 seconds (0.11 msec per iteration)

The code with changes:

int _tmain(int argc, _TCHAR* argv[])

{

// int Width = 1280, Height = 960;

int Width = 4096, Height = 16;

IppiSize iSize; iSize.width = Width; iSize.height = Height;

float* pIm[8];

for (int i = 0 ; i < 9 ; i++)

{

pIm = (float*)new float[Width * Height];

ippiSet_32f_C1R((float)(100-i), pIm, Width*sizeof(float), iSize);

}

// libInfo();

ippInit();

IppStatus sts;

sts = ippInitCpu(ippCpuC2D);

cout << "Ipp init: " << ippGetStatusString( sts ) << endl;

cout << "Number of cores = " << ippGetNumCoresOnDie() << endl;

int NumProcessors = 1; /*=2;*/

ippSetNumThreads(NumProcessors);

ippGetNumThreads(&NumProcessors);

cout << "Testing with " << NumProcessors << " processors" << endl;

clock_t tStart = clock();

int NumRepeats = 2000;

/*

for (int i = 0 ; i < NumRepeats ; i++)

{

// Ref1 = Ref1 = A - B

ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);

// Ref1 = Ref2 = C - D

ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);

// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)

ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);

ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);

// Result = Sqrt(Temp1 + Temp2)

ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);

ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);

}

*/

for (int i = 0 ; i < NumRepeats ; i++)

{

// Ref1 = Ref1 = A - B

ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);

// Ref1 = Ref2 = C - D

ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);

// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)

/*

ippiSqr_32f_C1IR(pIm[2], Width*sizeof(float), iSize);

ippiSqr_32f_C1IR(pIm[5], Width*sizeof(float), iSize);

*/

ippiMul_32f_C1IR(pIm[2], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);

ippiMul_32f_C1IR(pIm[5], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);

// Result = Sqrt(Temp1 + Temp2)

ippiAdd_32f_C1IR(pIm[2], Width*sizeof(float), pIm[5], Width*sizeof(float), iSize);

ippsSqrt_32f_I(pIm[5],Width*Height);

}

clock_t tEnd = clock();

double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;

double tMsec = 1000.0*tSec / NumRepeats;

cout << NumRepeats <<" Iterations in " << tSec << " seconds (" << tMsec << " msec per iteration)" << endl;

getchar();

for (int i = 0 ; i < 9 ; i++)

delete[] pIm;

return 0;

}

Thanks,
Ivan

paulsgauthier · ‎01-11-2012

Thanks for your effort Ivan, this is very interesting. It shows that the speedup from multiple cores is highly dependant on the use of the processor's memory cache.

It also showed me that different IPP functions that do basically the same thing can take significantly shorter time to execute (ippsSqrt_32f_I() compared to ippiSqrt_32f_C1R(), for example).

I'll experiment with processing our images in small blocks that can fit in the cache.

Paul G.

igorastakhov · ‎01-12-2012

Paul,

there are several problems: 1) your code uses 9 images/memory buffers - 5 Mbyte each - so as operations are too simple (add, sub, sqr, sqrt) but for each you need to perform 1 load and 1 store - all load in concentrated around memory bus - and you know - you can't speedup "copy" operation with multiple threads - you have only one memory bus; this means that you should optimize your code for cache size and data reuse - therefore you should perform processing by rather small slices. 2) ippiSqr is not threaded - this is why Ivan used Mul instead of Sqr - imagine that Sub is threaded - so work is divided between 2 CPUs and data - between their caches; then you call Sqr - it is not threaded and therefore all data is processed by 1 CPU - so all data from cache of 2nd CPU must be transfered to c=the cache of the 1st CPU; then you perform Add operation - it is threaded - that means that all data again must be spreaded between 2 caches... Mul is threaded; 3) ippiSqrt is marked as threaded - but it is not fully so - 2D Sqrt is based on 1D Sqrt (row by row) - and 2D Sqrt doesn't have special 2D threading - threaded is only 1D Sqrt - and it (1D Sqrt) has internal criterion ==4K - so it is threaded for vectors >= 4K - this is why Ivan used directly 1D Sqrt - to guarantee that threaded code works; 4) for your case the best approach (from the performance point of view) is to redevelop your code as a loop row by row, link with non threaded static IPP lib and to use #pragma parallel for before the loop - threading at the primitive level is not so efficient as at the application level - this is why we are promoting DMIP and are going to remove (deprecate) threading at the primitive level in IPP 8.0

Regards,
Igor

paulsgauthier · ‎01-12-2012

Igor,

Yes, I agree that disabling the IPP threading and doing the threading in our own application is the better approach for us with our big images. When I do that for one of our functions I get a minor speedup (5%) with the two processors in my test system. Since both external threads are calling the same IPPI functions to process different slices of the same image, I think the cache limitations come in to play.

Thanks for your help.

Paul