Play video sequent

nikkey_c · ‎04-11-2010

Hi, i am new to TBB, I am using tbb's pipelining for my project. Its about video processing (normal background subtraction). for that i divide my program in to three,

1-InputFilter -> read video frames one by one

2-TransformFilter-> do background subtraction

3- OutputFilter-> display that video (not writing as video file)

I did as tbb pipline tutorial example, it works but the problem is i couldnt able to get smooth output. it play play sometimes and stay for some sec and play again. So i want to display the video in a particular time interval. so how to do such a thing using pipeline.

Thanks

Nike

robert-reed · ‎04-12-2010

What happens if you leave out the background subtraction filter (connect 1 to 3)? Do you get the desired isochrony? What regulates the delivery of frames in the output stage? (i.e. who's doing the buffering?) Where's the time being spent when the transform filter is included (hot spot analysis)? Does stage 2 take more time than the rest? Is it declared serial or parallel? Any chance to parallelize the processing within the transform filter, like with a parallel_for?

nikkey_c · ‎04-12-2010

here stage 2 is important one, because it is doing background subtraction. so it takes more times. what i am doing:

1 in stage one, i read frame by frame and store those frames into array. after that array filled (assume we can store only 8images), i store those data into buffer. this stage runs in serial.

2 after that, background subtraction runs. it runs in parallel, and result image store into another array, it takes long time for computation. It subtract previous frame from current frame. and do some modification, in current frame. store that subtracted image.

3 then above result will display in serial.

Problem is i am not controlling output filter, i am just displaying images that are passed from stage two. due to that display frame is not smooth. What i want to do is, i have to display images in a sequence time (say x fps), for that i have to assign a thread buffer, and display it insequence. But i dont know how to implement separate thread for output filter within pipeline? may i know how to do that?

In above three stages, second stage takes lot of time compare to others.

ya we can running some for loop in second frame, but some runs fewiterations. but some run for large time. but i didnt use parallel_for for that. i am doing only pipeline.

If i use pipeline, is it possible to useparallel_for also? If i use only parallel_for, is it possible to improve performance than pipe line, because we read every frame and do background subtraction. it runs in

within one main for loop. and we are reading every pixel by using two big for loops.

My small video clip is 36 sec one and 25 fps

but when i run using pipeline, it takes 57 sec. but if i read frames and display frames (no background subtraction), then it takes less than 15 sec.

RafSchietekat · ‎04-12-2010

"But i dont know how to implement separate thread for output filter within pipeline? may i know how to do that?"
tbb_thread and concurrent_queue? (Added) I mean a tbb_thread outside of the pipeline; the last filter would just add the work to the queue for delivery to the tbb_thread.

"If i use pipeline, is it possible to use parallel_for also?"
Nesting parallel_for inside pipeline is a good way of creating parallel slack (opportunities to execute things in parallel), assuming the units of work are sufficient to avoid excessive parallel overhead, which seems to be satisfied from your description.

Note that your pipeline will behave as if stages 1 and 2 are just one stage, if that helps to understand what is going on. In computation-bound work, assumed by TBB, that is probably a good thing, but it also means that any latency in getting more data is going to result in a stalled worker thread, i.e., no background subtraction work being done during that time. I wouldn't be surprised if the solution ended up being several threads, with TBB only used for a parallel_for to do the background subtraction, which it shoulddo very well. (Added) Orreasonably well: I wanted to balance off the gloom with an accolade, but I'm sure that design could still be improved.

nikkey_c · ‎04-12-2010

My all system performance is depend on second stage, that is background subtraction. as i mention earlier, it takes less than 10 sec to read all video, but rest 44 sec is for background subtraction in my video.

In background subtraction there is two big for loop. that reads all the pixels of image two times. this happens every frames. so if i use only parallel_for will it give best performance than pipeline?

I think if i use parallel_for inside pipeline to read those two image then it will be too overhead for system. is it?

RafSchietekat · ‎04-12-2010

"I think if i use parallel_for inside pipeline to read those two image then it will be too overhead for system. is it?"
You would probably read the data once and then let parallel_for process it one range at a time, so you wouldn't be reading the data multiple times (although there would be some inter-cache traffic, but I'm not sure how that could be avoided). The overhead you should be concerned with would be in task creation (don't use a simple_partitioner with a grainsize of one pixel: always process one or several full horizontal lines at once by using a single-dimensional vertically oriented blocked_range), and in the barrier to wind down parallel_for processing (although auto_partitioner should behave well enough for work that is fairly uniformly distributed like this seems to be).

nikkey_c · ‎04-12-2010

sorry to ask same question again,

I want to display image smoothly like video, that is i have to display image every x secs from second stage output buffer.so i have to create thread and monitor output buffer of second stage, that is third stage. this thread check every x sec and check whether background subtracted image is available in outputbuffer, if so display, then check another x sec and do as mention above. for that how i canimplement?

RafSchietekat · ‎04-12-2010

You can't do that directly, you would instead have to send the output of the third stage, which would probably be ordered_serial, to a thread, probably by way ofa concurrent_queue. If you waited inside TBB instead, that worker thread would be unavailable to do anything else. Although I'm not entirely sure that a pipeline is going to be at all useful.

nikkey_c · ‎04-12-2010

thanks, my final question is, is it good idea to useconcurrent_queue to store output of second stage image (second stage runs in parallel)? because we need display image in correct order but second stage runs parallel, therefore frame number maychangeafter second stage finished. But we are displaying image in third stage using pops.

RafSchietekat · ‎04-12-2010

You seem set on waiting inside the third stage, ignoring my advice, aren't you?

If you care about the order of the frames, you should only add them to a queue in an ordered serial stage.

nikkey_c · ‎04-12-2010

ya,

but i have to store background subtracted images into array and pass that array to output filter. that is i am sending input frame buffer from stage-1 to stage-2. there i am doing background subtraction and store result image (new image) to another array and send that to third stage. therefore will it make problem for order of frame in third stage?

RafSchietekat · ‎04-12-2010

Each stage takes a cookie and returns a (potentially different) cookie, typed void* because it is typically a pointer to data. You have to be prepared for more than one data item in flight at any one time (otherwise the pipeline would degenerate into a simple loop), i.e., multiple frames may exist at the same time, but if you pass the correct values ineach stage, the ordered-serial output filter will automagically process the frames in the same order as the serial input filter.

nikkey_c · ‎04-12-2010

i think my question is wrong.. I think I have to maintain same buffer for all stages. am i right?

robert-reed · ‎04-12-2010

OK, you have a sequence of buffers that are filled by the input filter; each buffer contains a single frame of the sequence. It sounds like the threads only read these buffers, writing their changes to separate buffers, right? However, the intermediate stage needs access to two adjacent buffers to do the background subtraction processing. Likewise it sounds like you need a set of buffers for the output side, filled by the middle filter and emptied by the output filter. It would be better if you can reuse both the input buffers (returning them to the input filter after the second background subtraction process to use them finishes) and the output buffers (returning them to the middle filter for refilling after the output filter has marshalled them on to the video output stream) rather than creating and destroying them in the process. I have an old set of blogs that talk a bit about such buffer management in the processing of a TBB pipeline that might provide some details you can use.

The key to providing sufficient processing to keep ahead of the output filter is efficient buffer management and dividing the work efficiently between available threads to minimize the background subtraction process, which it soundsvaries in themount of work based on the input data (presumablyrapidly changing frames require more processing to determine what is the background). It also sounds like just making the middle filter parallel did not provide enough processing to accomplish the goal. I go back to an idea I suggested originally, of employing a parallel_for within the middle filter,using frame banding like Raf suggested or some similar process to cleanly divide the work among available threads. This would be an important addition to provide load balancing between the lightly processed frames and the more heavily processed ones--if each frame is represented as a single, indivisible task, there's no opportunity tobalance the frames among the threads.

nikkey_c · ‎04-12-2010

actually my task is not only a background subtraction, i have to track, encode and some other process for every frame, thats why i chose pipeline than only parallel_for. Then therewould be nearly three to four parallel stage will happen in the intermediate stage. Most of intermediate stages have big for loop(to read every pixels of image, if pixel size is 400x400 then we have two big for loop and each of them have 400iteration). i think in this situation i can use pipelineinsteadof parallel_for. am i right? or any other idea?

RafSchietekat · ‎04-12-2010

#13 "However, the intermediate stage needs access to two adjacent buffers to do the background subtraction processing."
Really? I didn't pick up on that. Guess I don't know what background subtraction really does (I thought each frame would be processed relative to a still reference frame), or why the work would vary significantly from frame to frame.

#14 "i think in this situation i can use pipeline instead of parallel_for."
A pipeline is a great way to divide a big workload, but perhaps not if latency is a concern, or if simultaneously processing frames compete for the cache (400x400 pixels at 3 bytes per pixel is half a megabyte). In such a situation you'd want to limit the number of frames "in flight", and use parallel_for on horizontal strips (or "bands"?) to provide parallel slack. Actually, there may also be an affinity concern in that each strip may need to be processed entirely on a single core (TBB doesn't do NUMA yet, I think, so staying with the same core seems like the best tactic in such situations); perhaps Robert knows whether affinity_partitioner does anything for parallel_for inside a pipeline, but without such assurance I might even opt for parallel_for instead of a pipeline (not just inside one), and then, unless I'm mistaken, affinity_partitioner would still help to coordinate strips of the varying frames with the corresponding strips of the reference frame (or whatever related constraint applies).

Does that make sense?

nikkey_c · ‎04-13-2010

ya... thanks for your reply, it helps to develop my task. once again thank you.

I have another question, I have to run one thread (say thread to display image) in every 30ms and check queue and display. Then what is the efficient way to implement using tbb?

robert-reed · ‎04-13-2010

This topic has been discussed elsewhere in the forum, though I haven't paid enough attention to it recently to know if it has been resolved. You might start by looking at this thread.