Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

OpticalFlowPyrLK performance

Rasmus_Debitsch
Beginner
465 Views
Hi,
I want to replace the OpenCV KLT with the IPP implementation.The IPP function has a nice interface for reusing the pyramids of the images.

But using the IPP function, the performance goes down by a factor of two or more. Looking at the program with VTune (See for a list of top consumers below), it turns out that most of the time is spend for threading organization.

I tried static and dynamic linking with IPP but the result is the same. Decreasing the number of threads using ippSetNumThreads improves performance.

Any idea what is wrong?

Thanks,
Rasmus

Used: IPP 7.0.5, static IPP libraries, Release|x64,Visual Studio 2008C++ compiler


"Function / Call Stack","CPU Time","Module","Function (Full)"

"SleepEx","253.3485215","KERNELBASE.dll","SleepEx"

"_kmp_fork_call","153.1760516","libiomp5md.dll","_kmp_fork_call"

"RtlEnterCriticalSection","148.438029","ntdll.dll","RtlEnterCriticalSection"

"y8_ippiOpticalFlowPyrLK_8u_C1R","130.25269","SFMPlugin.dll","y8_ippiOpticalFlowPyrLK_8u_C1R"

"vcomp_for_static_simple_init","70.78258687","libiomp5md.dll","vcomp_for_static_simple_init"

"_kmpc_omp_taskyield","65.49529073","libiomp5md.dll","_kmpc_omp_taskyield"

"y8_ownCopySubpix_8u16u_C1R_Sfs_U8","9.835349407","SFMPlugin.dll","y8_ownCopySubpix_8u16u_C1R_Sfs_U8"

"_kmp_invoke_microtask","9.093397538","libiomp5md.dll","_kmp_invoke_microtask"

"RtlLeaveCriticalSection","6.78507023","ntdll.dll","RtlLeaveCriticalSection"


0 Kudos
8 Replies
SergeyKostrov
Valued Contributor II
465 Views
...
I tried static and dynamic linking with IPP but the result is the same. Decreasing the number of threads using ippSetNumThreads improves performance.
...

- There are less context switches between threads
- There are less "fights" over data when asynchronization object (a Critical Session )is used
- It is possible that a CPU'scache lines are used more efficiently

How big are data sets?
How many threads did you set, that is before and after?

Best regards,
Sergey
0 Kudos
Rasmus_Debitsch
Beginner
465 Views
Thank you for the answer.

I'm using a640x480 video with approximately 600 frames. Each frame is processed. The OpenCV implementation needs~25s for processing (not all for optical flow). Switching to the IPP implementation raises the time used to ~30s with one thread. Two threads take about 50s, and 8 threads (the default, beacuse I have a quadcore with hyperthreading) more than 100s.

I tried with some other test videos and the results are the same.

Threading seems to work - the workload of my machine is proportional to the number of threads configured. As far as I understand the VTune result a significant amount of time is spend in the OpenMP code?

I've attached a snippet tracking.cpp containing the inititalization code and the implementation of the track function.

Best regards,
Rasmus
0 Kudos
Ying_H_Intel
Employee
465 Views
Hi Rasmus,

It seem we can't expect the OpenMP thread always bring benefits for a speical application. especially,your application use OpenM threads intermittently.Twoquick points,

1) you mentioned staic ipp link. Do you link the threaded ipp (ipp*_t.lib)or serial static ipp (ipp*_l.lib)?

2) It is good to reuse the Pyramid. it is possible to reuse the "state" for all of frames, as they are should be same in each operation, thussave the repeated operation?

For example, the below functions should be call one timeacross the whole processing if image size remain unchanged;

IppiPyramid* pyr = NULL;

stat = ippiPyramidInitAlloc(&pyr, maxLevel, roi, rate);
if (stat != ippStsNoErr)
{
USES_CONVERSION;
VGDebug(L"function %s status %s\n", L"ippiPyramidInitAlloc", A2T(ippGetStatusString(stat)));
return NULL;
}

IppiPyramidDownState_8u_C1R** pyrState = (IppiPyramidDownState_8u_C1R**) &(pyr->pState);
Ipp8u** pyr0 = pyr->pImage;
int* pStep = pyr->pStep;
IppiSize* pRoi = pyr->pRoi;

ippiPyramidLayerDownInitAlloc_8u_C1R(pyrState, roi, rate, kernel, _countof(kernel), IPPI_INTER_LINEAR);

pyr0[0] = (Ipp8u*) img.data;
pStep[0] = (int) img.step;
pRoi[0] = roi;
for (int i = 1; i <= maxLevel; ++i)
{
pyr->pImage = ippiMalloc_8u_C1(pRoi.width, pRoi.height, pStep + i);

and
stat = ippiOpticalFlowPyrLKInitAlloc_8u_C1R(&of, roi, winSize.width, hint);
and all free options.

Best Regards
Ying
0 Kudos
Rasmus_Debitsch
Beginner
465 Views
Hi Ying,
1) I'm using the threaded ipp libraries.
2) Reusing the pyramids was my idea. Thank you for your advice to avoid the repetitive memory allocation. But it is not the problem. The IPP LK tracker is slower than OpenCV version. And it gets slower if I add more threads. So it seems that I'm doing something wrong when calling OpticalFlowPyrLK. As long as I can't fix this issue, the IPP LK is not usable.
Unfortunately the OpenCV computes the pyramids internally. I can't pass them to the tracker for reuse. I wanted to overcome this issue by using the IPP implementation.
And even more - usually I expect that IPP is faster than OpenCV.
Best Regards,
Rasmus
0 Kudos
Ying_H_Intel
Employee
465 Views
Hi Rasmus,

How about the resultif use the serial library?

Or could you pleaseprovide arunable and comparable smalltest caseso we can evaluate what the problem (including some basic info: like opencv version.).Youcanattach it by private if confidence.

Best Regards,
Ying

0 Kudos
Rasmus_Debitsch
Beginner
465 Views
Hi Ying,

I extracted the tracking part and generated a small demo program. The demo has a OpenCV and an IPP tracker.

KLTTest.cpp

Please note, that the feature extracting is a little bit different in the original program.

Environment: OpenCV 2.3.1, Visual Studio 2008, Release|64bit, Unicode, IPP supportadded to the project using the Intel Composer.


Best regards,
Rasmus
0 Kudos
Ying_H_Intel
Employee
465 Views
Hi Rasmus,

Iescalate your problem to IPP engineer team.

your current resultlooks true as i read from OpenCV website:

the version 2.3.1 (August, 2011) it have the claim.

Optimization

  • Performance of the sparse Lucas-Kanade optical flow has been greatly improved. On 4-core machine it is now 9x faster than the previous version.

Best Regards,
Ying

0 Kudos
Rasmus_Debitsch
Beginner
465 Views
Hi Ying,

thank you for the info. If you engineer team needs any further information feel free to contact me.

Best Regards,
Rasmus
0 Kudos
Reply