Re: Trying to achieve parallel execution in using IPP function

mmdev · ‎10-28-2009

Hello:

I have written an application that creates and runs multiple threads on multiple Corei7 Intel processors. My system has eight cores. My number of threads is four.

Each of the threads, which are not synchronized with each other, calls into the ippiAlphaCompC IPP function. The threads are all created and started from the same application main thread.

Through some analysis, it seems that the function ippiAlphaCompC blocks completely, i.e. when thread1 on core1 has invoked function ippiAlphaCompC, a call to function ippiAlphaCompC from thread2 running on core2 must wait to execute until the thread1 ippiAlphaCompC call has completely finished. What I'm trying to do is achieve parallelism, by processing multiple sets of data at once across multiple threads/cores using common IPP functions.

Does this make sense? Would one expect this function to block? Is there a way to achieve parallelism with this function across multiple threads/cores (without, e.g. using multiple loads of the IPP dlls within multiple processes).

I'm using dynamic linking and dispatching and I have set the number of threads in IPP to 1 (although leaving the default value to 8 on my system does not seems to make a difference).

Thanks.

Vladimir_Dudnik · ‎10-29-2009

IPP functions are not blocking. How did you come to the conclusion that you see blocking effect?

Regards,
Vladimir

mmdev · ‎10-29-2009

Hi Vladimir,

I have a simple test application with some timing benchmarks.

If I call this function several times within one thread, serializing the function calls, the timing data I observe for each call is what I would expect, approx 3msec.

Thread1()
{
call functionx() // 3msec
call functionx() // 3msec
}

If I separate the calls into two separate threads, run on two separate cores, wouldn't one still expect the result to be the same? They should run in parallel. The timing returned is quite higher in this second case.

Thread1() // on core1
{
call functionx()
}

Thread2() //on core2
{
call functionx()
}

Thanks.

Vladimir_Dudnik · ‎10-29-2009

I would expect twice less time for parallel run. Although a total execution time is sum of time required to start/end thread and time for actual processing of data. If amount of data is small enough than threading overhead may overcome benefits from parallel execution.

Vladimir

mmdev · ‎10-29-2009

Quoting - Vladimir Dudnik (Intel)

I would expect twice less time for parallel run. Although a total execution time is sum of time required to start/end thread and time for actual processing of data. If amount of data is small enough than threading overhead may overcome benefits from parallel execution.

Vladimir

Vladimir,

My timing measurements are not including the threading overhead. My code is only measuring the time spent in the IPP function call(). The threading overhead is important, but for now I'm just focused on the work done in the IPP function (and time to do that work).

If the two calls are occurring on two separate threads on two separate cores (on two completely separate sets of image data), would you expect the timing result for each call be about the same as in the serialized, single thread case?

Thanks.

-Mike

Vladimir_Dudnik · ‎10-29-2009

If in your serial code you call IPP function twice which give you 3 + 3 = 6 msec of running time then running these functions in parallel I would expect total time about 3 msec (just 3 is ideal case not achivable on practice)

Vladimir

mmdev · ‎10-29-2009

Quoting - Vladimir Dudnik (Intel)

If in your serial code you call IPP function twice which give you 3 + 3 = 6 msec of running time then running these functions in parallel I would expect total time about 3 msec (just 3 is ideal case not achivable on practice)

Vladimir

Okay. That's what I'm expecting to see as well - each call in the serialized case is 3msec, so I would expect each single call in the multithreaded/multi-core case to be 3msec as well. Since I'm not observing this, and the numbers sometimes can bequite a bit higher (sometime close to twice), I've drawn the conclusion (perhaps incorrectly) that the IPP call is blocking. In this simple test case, with no other processing occurring, I'm having a difficult time thinking of another cause.

Thanks.

-Mike

Vladimir_Dudnik · ‎10-29-2009

Yeah, that is something unexpected. Could you please try to link with non threaded static libraries to see if there is any difference? What version of IPP do you use, perhaps the latest IPP 6.1 update 1?

Vladimir

mmdev · ‎10-29-2009

Quoting - Vladimir Dudnik (Intel)

Yeah, that is something unexpected. Could you please try to link with non threaded static libraries to see if there is any difference? What version of IPP do you use, perhaps the latest IPP 6.1 update 1?

Vladimir

Okay. I'll try that and report back my findings. I'm using version IPP 6.1, but perhaps not the latest (update 1?).

Thanks.

Mike

Vladimir_Dudnik · ‎10-29-2009

If you can upload your test case here we also would like to investigate what might be the problem with it.

Vladimir

Mikael_Grev · ‎10-29-2009

Guys,

3ms is too short a time to test this since the recidual processing of the loading of the test app can saturate one core during those 3msec. Do a for loop around every call and try again.

Also, depending of what timer you use, timer resolution can be a problem as well. Never benchmark much below 1 sec.

Cheers,
Mikael

Trying to achieve parallel execution in using IPP function ippiAlphaCompC