Pipeline buffer between stages?

Neeley · ‎10-28-2009

I have set up a pipeline that reads image data from a file -> performs a filter -> performs a filter -> output result to the screen. Just to be sure that i had things set up correctly (and I understood what I am doing) I replace the first and last pixel with a frame counter just after reading the data from file in my input stage. Each stage takes precautions to maintain the original frame counter in these pixels.After checking these pixels at my output stage, I am finding that these counters will get out of sinc, sometimes the frame numbers do not stay sequential and sometimes the first and last pixel does not match. I have even set each filter to serial_in_order and still get these oddresults. This leads me to believe I should place buffers between my stages. So here are my questions:

1) Should I place a buffer between each stage of the pipeline?
2) How large should these buffers be?
3) Should I use concurrent_queue to implement these buffers?
4) If I push and pop from within each stages operator() what would I return as a token?
5) Is there a better solution to this problem?

I know that debugging a problem with out seeing the actual code is hard,but any advice on this subject would be greatly appreciated.

RafSchietekat · ‎10-28-2009

"sometimes the first and last pixel does not match"
I would suggest holding a magnifying glass over your own code first.

Anton_Pegushin · ‎11-05-2009

Quoting - Neeley

I have set up a pipeline that reads image data from a file -> performs a filter -> performs a filter -> output result to the screen. Just to be sure that i had things set up correctly (and I understood what I am doing) I replace the first and last pixel with a frame counter just after reading the data from file in my input stage. Each stage takes precautions to maintain the original frame counter in these pixels.After checking these pixels at my output stage, I am finding that these counters will get out of sinc, sometimes the frame numbers do not stay sequential and sometimes the first and last pixel does not match. I have even set each filter to serial_in_order and still get these oddresults. This leads me to believe I should place buffers between my stages. So here are my questions:

1) Should I place a buffer between each stage of the pipeline?
2) How large should these buffers be?
3) Should I use concurrent_queue to implement these buffers?
4) If I push and pop from within each stages operator() what would I return as a token?
5) Is there a better solution to this problem?

I know that debugging a problem with out seeing the actual code is hard,but any advice on this subject would be greatly appreciated.

Hi,

what about using data-dependent breakpoints in a debugger to find out which thread and from which function write-accesses the first and the last pixel in the image? Since you're saying that you changed all of the stages to be serial_in_order, but the error did not go away, I'd assume that the problem is not with parallelism, but with the image processing code.

Neeley · ‎11-05-2009

Quoting - Anton Pegushin (Intel)

Hi,

what about using data-dependent breakpoints in a debugger to find out which thread and from which function write-accesses the first and the last pixel in the image? Since you're saying that you changed all of the stages to be serial_in_order, but the error did not go away, I'd assume that the problem is not with parallelism, but with the image processing code.

Ihave fixed some of the problems that I was having. This is what I am doing now. I the input stage of the code I replace the first and last pixel with the frame number. Then in each of the piplelines that follow the first operation is to copy from the token that is passed into a local array. Then Istore the first and last pixelinto a local varible. When I leave the stage I replace the first and last pixel with the sotred values and copy into the token that I am passing out. This helped alot but still did not completely fix the problem. I then switched all the stages but the first and last to parallel, now the two stages after the input act as expected, but thelast still has problems. I also started printing out the test pixels to individual txt files. one for each stage instead printing to the screen.

RafSchietekat · ‎11-05-2009

Why copy to a "local array" (seems expensive and useless)? What happens if you initialise task_scheduler_init with argument 1? If this does not make the problem go away, what if you actually make the program serial? Are you using vector instructions or plain C++? Note that parallel filters are run right after any preceding filter on the same thread, so if there is a difference with running them serially that should provide a clue.

Neeley · ‎11-06-2009

>> Why copy to a "local array" (seems expensive and useless)?

Raf, how would you do this?

I originally was not doing this, but then I realized that since we are passing a (void*) between filters, if you do not protect the data while filter 2 is reading its input (filter1's output), filter 1ischanging the data. I am now thinking that I should probably wrap the memcpy in a CRITICAL_SECTION to further assure that I am not reading and writing at the same time.

>> What happens if you initialise task_scheduler_init with argument 1? If this does not make the problem go away, what if you actually make the program serial? Are you using vector instructions or plain C++? Note that parallel filters are run right after any preceding filter on the same thread, so if there is a difference with running them serially that should provide a clue.

I started with a serial implementation of this code, so yes things work right when I run it serially. The only reason I started inserting the frame count into the firstand last pixel was to convince myself that I did not have concurrencey problems with the threaded implementation.

Alexey-Kukanov · ‎11-06-2009

Why not allocate a memory buffer for each token (a portion of data processed by filters at once), pass this biffer through all pipeline stages, and free in the last one after the buffer is no more necessary? This way you should have no conflicts between filters. Sorry if I miss the reasons for the buffer being shared.

Neeley · ‎11-06-2009

Quoting - Alexey Kukanov (Intel)

Why not allocate a memory buffer for each token (a portion of data processed by filters at once), pass this biffer through all pipeline stages, and free in the last one after the buffer is no more necessary? This way you should have no conflicts between filters. Sorry if I miss the reasons for the buffer being shared.

Thanks for the input. I think this answers my original question the best so far. I am starting to understand that the pipeline does help with the threading, but memory concurrency is still up to the user. I should have realized this as soon as I saw that the pipe line passes pointers. I think I have found that yesI do need some kind of bufferin between each stage of the pipeline,but those buffersonly have to be large enough to ensure that nodata races (concurrency issues) occur.

RafSchietekat · ‎11-06-2009

#5 "I originally was not doing this, but then I realized that since we are passing a (void*) between filters, if you do not protect the data while filter 2 is reading its input (filter1's output), filter 1 is changing the data."
No, only one filter at a time is using that particular void*, and the referenced data is implicitly synchronised from one filter to the next if only plain C++ is being used. Don't handicap your program's performance with needless protection against imaginary races. The idea is actually that filters are visiting the data, not the other way around, which is important for cache locality. If you need to transform an image, then a void* value can point to two adjacent buffers, with each filter transforming the data directly from one buffer to the other (without an intermediate buffer!), although it would be better still to reuse one buffer if the transformation only needs local data (something like colour shift instead of image warp).

Neeley · ‎11-06-2009

>>No, only one filter at a time is using that particular void*, and the referenced data is implicitly synchronised from one filter to the next if only plain C++ is being used. Don't handicap your program's performance with needless protection against imaginary races.

This is exactly what I was hoping, but the test did not seem to show this. I must be doing something wrong.

>>The idea is actually that filters are visiting the data, not the other way around, which is important for cache locality.

I will have to get my head around this. Are these buffers created outside the filtersor are they members of the filters?

>>If you need to transform an image, then a void* value can point to two adjacent buffers,

Do you mean to use a memory block that is twice as large of the imagewhere the first half is the input and the second half is the output?If not Ireally do not understand how one pointer can point to two buffers?

>> although it would be better still to reuse one buffer if the transformation only needs local data (something like colour shift instead of image warp).

I have examples ofboth,will also need to go trough an image and compile a list and in the next filter work on that list.

Are there other exaplesforpipeline? I have been locking for some but cannot find any thing butsimplestring manipulations.

I thank every one that has commented on this topic. I have tried to figure this out on my own(Ieven bought and read Reinders book)and only posted the questions here as a last resort. All of your comments are a big help.

RafSchietekat · ‎11-07-2009

"I will have to get my head around this. Are these buffers created outside the filters or are they members of the filters?"
Dealing with buffers is your job, the pipeline only passes void* values from one filter to the next. The values can change from input to output, but the pipeline will try to apply successive filters on the same thread to improve locality. I don't know how important that is here, though, but for understanding what you are seeing you should probably know that successive parallel filters are executed one after the other on the same thread.

"Do you mean to use a memory block that is twice as large of the image where the first half is the input and the second half is the output? If not I really do not understand how one pointer can point to two buffers?"
Yes, just to avoid an intermediate copy. Or you keep them separate, and let each filter read from the input and write to the newly allocated output, which is then passed to the next filter after the input is discarded. Just avoid using an intermediate local buffer with a wasted copying action.

Sorry I couldn't help with the actual problem, which remains a mystery.

Neeley · ‎11-07-2009

Quoting - Raf Schietekat

"I will have to get my head around this. Are these buffers created outside the filters or are they members of the filters?"
Dealing with buffers is your job, the pipeline only passes void* values from one filter to the next. The values can change from input to output, but the pipeline will try to apply successive filters on the same thread to improve locality. I don't know how important that is here, though, but for understanding what you are seeing you should probably know that successive parallel filters are executed one after the other on the same thread.

"Do you mean to use a memory block that is twice as large of the image where the first half is the input and the second half is the output? If not I really do not understand how one pointer can point to two buffers?"
Yes, just to avoid an intermediate copy. Or you keep them separate, and let each filter read from the input and write to the newly allocated output, which is then passed to the next filter after the input is discarded. Just avoid using an intermediate local buffer with a wasted copying action.

Sorry I couldn't help with the actual problem, which remains a mystery.

Thank you for your time and comments on this subject.

I see how the approach you explain above would make data races a non issue, since you are not reusing any memory but creating it as needed and deleting after it had been used. I had assumed that reusing a buffer that is created at the beginning of process, would be faster than creating and deleting memory during the process. by reusing the memory it does cause data race conditions.

I now plan to take a step back in my design and investigate which of these approaches are better for what I am trying to accomplish.

RafSchietekat · ‎11-07-2009

"by reusing the memory it does cause data race conditions"
Not necessarily: there is no such problem if the memory is not used across data item contexts. Refurbishing buffers instead of going through the allocator is always a good idea, but that's a different issue.

RafSchietekat · ‎11-08-2009

Quoting - Raf Schietekat

"by reusing the memory it does cause data race conditions"
Not necessarily: there is no such problem if the memory is not used across data item contexts. Refurbishing buffers instead of going through the allocator is always a good idea, but that's a different issue.

Well, almost always.

Let's have some clarity on what you meant by "local copy": filter instance variable (wouldn't work in a parallel filter, because there is only a single instance, which is invoked in parallel), or automatic variable inside the invocation implementation (still needlessly expensive, but at least OK for thread safety)?

Neeley1 · ‎11-08-2009

Quoting - Raf Schietekat

Well, almost always.

Let's have some clarity on what you meant by "local copy": filter instance variable (wouldn't work in a parallel filter, because there is only a single instance, which is invoked in parallel), or automatic variable inside the invocation implementation (still needlessly expensive, but at least OK for thread safety)?

I was talking about an array that was a member of the filter class and allocated in the constructor of the filter.
After considering the things I have learned from this post I am doing it differently. I have discarded my old code and started again. I am even considering investigating VS2010B2 Asynchronous Agents Library to build my pipeline.

RafSchietekat · ‎11-08-2009

Quoting - Neeley

I was talking about an array that was a member of the filter class and allocated in the constructor of the filter.
After considering the things I have learned from this post I am doing it differently. I have discarded my old code and started again. I am even considering investigating VS2010B2 Asynchronous Agents Library to build my pipeline.

Don't put anything related to a data item in a parallel filter instance variable (unless you really know what you're doing): at best this would sabotage scalability (with correct synchronisation), and more likely it would make the program fail (without correct synchronisation). Even in a serial filter, you should give serious thought about any costs related to reusing a resource that stays with a filter instead of with the data item, such as copying data into and out of a buffer, and you should certainly not assume that such state stays valid long enough for access from a subsequent filter. If you want to refurbish a resource between data items (recycle from last filter to be picked up again from the first filter), which is often a good idea, you have to write correctly synchronised code yourself (unfortunately), but it may still be worth it, more than reusing a reusing a resource that stays with a filter. Maybe the documentation could be made more explicit about expectations, to avoid such incorrect usage? I don't see any need to use an alternative, though, once this has been cleared up.

Vivek_Rajagopalan · ‎11-09-2009

Quoting - Raf Schietekat

Don't put anything related to a data item in a parallel filter instance variable (unless you really know what you're doing): at best this would sabotage scalability (with correct synchronisation), and more likely it would make the program fail (without correct synchronisation).

Golden advise. Unfortunately too late for me, I had to learn this from hard-knock university :-)

1) I removed all member variables except some statistics (about filter performance) into the work item.
2) This was bad of course, because the innocuous looking statistics were unprotected
3) Even if they were protected it would severely affect concurrency due to data contention.

So perhaps the rule of thumb ought to be "no member variables except const" in a filter class

Another option could be to enhance tbb to clone the user supplied filter class and map these clones to worker threads. (Which is how I thought things worked prior to the aforementioned hard-knock training).

Neeley · ‎11-09-2009

Quoting - Vivek Rajagopalan

Golden advise. Unfortunately too late for me, I had to learn this from hard-knock university :-)

1) I removed all member variables except some statistics (about filter performance) into the work item.
2) This was bad of course, because the innocuous looking statistics were unprotected
3) Even if they were protected it would severely affect concurrency due to data contention.

So perhaps the rule of thumb ought to be "no member variables except const" in a filter class

Another option could be to enhance tbb to clone the user supplied filter class and map these clones to worker threads. (Which is how I thought things worked prior to the aforementioned hard-knock training).

More info I did not get from reading all the documentation. Back to the drawing board.