I want to replace the OpenCV KLT with the IPP implementation.The IPP function has a nice interface for reusing the pyramids of the images.
But using the IPP function, the performance goes down by a factor of two or more. Looking at the program with VTune (See for a list of top consumers below), it turns out that most of the time is spend for threading organization.
I tried static and dynamic linking with IPP but the result is the same. Decreasing the number of threads using ippSetNumThreads improves performance.
Any idea what is wrong?
Thanks,
Rasmus
Used: IPP 7.0.5, static IPP libraries, Release|x64,Visual Studio 2008C++ compiler
"Function / Call Stack","CPU Time","Module","Function (Full)"
"SleepEx","253.3485215","KERNELBASE.dll","SleepEx"
"_kmp_fork_call","153.1760516","libiomp5md.dll","_kmp_fork_call"
"RtlEnterCriticalSection","148.438029","ntdll.dll","RtlEnterCriticalSection"
"y8_ippiOpticalFlowPyrLK_8u_C1R","130.25269","SFMPlugin.dll","y8_ippiOpticalFlowPyrLK_8u_C1R"
"vcomp_for_static_simple_init","70.78258687","libiomp5md.dll","vcomp_for_static_simple_init"
"_kmpc_omp_taskyield","65.49529073","libiomp5md.dll","_kmpc_omp_taskyield"
"y8_ownCopySubpix_8u16u_C1R_Sfs_U8","9.835349407","SFMPlugin.dll","y8_ownCopySubpix_8u16u_C1R_Sfs_U8"
"_kmp_invoke_microtask","9.093397538","libiomp5md.dll","_kmp_invoke_microtask"
"RtlLeaveCriticalSection","6.78507023","ntdll.dll","RtlLeaveCriticalSection"
链接已复制
I tried static and dynamic linking with IPP but the result is the same. Decreasing the number of threads using ippSetNumThreads improves performance.
...
- There are less context switches between threads
- There are less "fights" over data when asynchronization object (a Critical Session )is used
- It is possible that a CPU'scache lines are used more efficiently
How big are data sets?
How many threads did you set, that is before and after?
Best regards,
Sergey
I'm using a640x480 video with approximately 600 frames. Each frame is processed. The OpenCV implementation needs~25s for processing (not all for optical flow). Switching to the IPP implementation raises the time used to ~30s with one thread. Two threads take about 50s, and 8 threads (the default, beacuse I have a quadcore with hyperthreading) more than 100s.
I tried with some other test videos and the results are the same.
Threading seems to work - the workload of my machine is proportional to the number of threads configured. As far as I understand the VTune result a significant amount of time is spend in the OpenMP code?
I've attached a snippet tracking.cpp containing the inititalization code and the implementation of the track function.
Best regards,
Rasmus
It seem we can't expect the OpenMP thread always bring benefits for a speical application. especially,your application use OpenM threads intermittently.Twoquick points,
1) you mentioned staic ipp link. Do you link the threaded ipp (ipp*_t.lib)or serial static ipp (ipp*_l.lib)?
2) It is good to reuse the Pyramid. it is possible to reuse the "state" for all of frames, as they are should be same in each operation, thussave the repeated operation?
For example, the below functions should be call one timeacross the whole processing if image size remain unchanged;
IppiPyramid* pyr = NULL;
stat = ippiPyramidInitAlloc(&pyr, maxLevel, roi, rate);
if (stat != ippStsNoErr)
{
USES_CONVERSION;
VGDebug(L"function %s status %s\n", L"ippiPyramidInitAlloc", A2T(ippGetStatusString(stat)));
return NULL;
}
IppiPyramidDownState_8u_C1R** pyrState = (IppiPyramidDownState_8u_C1R**) &(pyr->pState);
Ipp8u** pyr0 = pyr->pImage;
int* pStep = pyr->pStep;
IppiSize* pRoi = pyr->pRoi;
ippiPyramidLayerDownInitAlloc_8u_C1R(pyrState, roi, rate, kernel, _countof(kernel), IPPI_INTER_LINEAR);
pyr0[0] = (Ipp8u*) img.data;
pStep[0] = (int) img.step;
pRoi[0] = roi;
for (int i = 1; i <= maxLevel; ++i)
{
pyr->pImage = ippiMalloc_8u_C1(pRoi.width, pRoi.height, pStep + i);
and
stat = ippiOpticalFlowPyrLKInitAlloc_8u_C1R(&of, roi, winSize.width, hint);
and all free options.
Best Regards
Ying
How about the resultif use the serial library?
Or could you pleaseprovide arunable and comparable smalltest caseso we can evaluate what the problem (including some basic info: like opencv version.).Youcanattach it by private if confidence.
Best Regards,
Ying
I extracted the tracking part and generated a small demo program. The demo has a OpenCV and an IPP tracker.
KLTTest.cpp
Please note, that the feature extracting is a little bit different in the original program.
Environment: OpenCV 2.3.1, Visual Studio 2008, Release|64bit, Unicode, IPP supportadded to the project using the Intel Composer.
Best regards,
Rasmus
Iescalate your problem to IPP engineer team.
your current resultlooks true as i read from OpenCV website:
the version 2.3.1 (August, 2011) it have the claim.
Optimization
- Performance of the sparse Lucas-Kanade optical flow has been greatly improved. On 4-core machine it is now 9x faster than the previous version.
Best Regards,
Ying
