First the process: I have a server that sits and listens to requests, for the sake of this example, lets say the requests come in on a named pipe (I've also used sockets for this, it all depends on what the customer wants - some secure places won't allow ANY sockets open even for communication on the same machine). When a reqest come in, the server then does some complicated computation and returns the answer to the client.
I thought this would be great for a parallel while. I have the pop_if_present function listen for a connection on the named pipe or socket (blocking) and when it gets something from the client, it returns a true and in populates the argument_type& with the request and a pointer to the client to send the reply to.
The body of the while loop takes data, does the compuation, sends the reply when it's done and the closes the socket.
the pop_if_present function looks like this
bool pop_if_present(Message*& pMessage)
NamedPipe* Pipe=new NamedPipe(PipeName);
printf("creating pipe "); fflush(stdout);
printf("reading pipe "); fflush(stdout);
printf("dispatching %s ",Command.c_str()); fflush(stdout);
and the body of the while loop looks like this
void operator()(Message* pMessage) const
printf("hello! "); fflush(stdout);
printf("receiving command ");
printf("sending result '%s' ",Result);
When I run the program, it listens on the pipe and the clients can properly send data to it, however, it seems that the body of the loop is never called. I get all the debug output for creating the pipe, reading the pipe, and dispatching the message. But I don't even get the "hello" printed from the body of the while loop.
Basically, I'll have a big processing engine running tbb that will accept reqests from clients to solve hard problems. The order that the client's requests get processes doesn't matter too much, although a queue is probably the best so that no client gets shut out for too long. The pattern that requests are made is unpredictable, it all depends on how many users will be running clients.
No, parallel_while and its successor parallel_do won't work for you in the current design. The reason is that in your design returning false from pop_if_present seems a norm (though I might be wrong here), while for those algorithms it means the end of their initial feeding. For parallel_do, we made this nature more obvious as it is fed with a pair of iterators,a typical interface to a container.
I think tbb::pipeline might suit your needs better. You would have (at least) two stages, one for input and one for processing. In the input stage, you should have a loop waiting for data (like parallel_while, there should be the "end of input" notion for pipeline, and returning NULL from the input stage serves this purpose). The irregular nature of thedata requests dictates the need for backoff in the busy-waitloop, unless the Pipe->Read command blocks. Once you got the request you exit the input stage passing the data (via return) for further processing in the pipeline.
Though it might look like the same waiting trick will work in your parallel_while based implementation, Ithink the pipeline is better for you because of a few reasons:
- The trick won't workwith parallel_while, because the algorithmassumes its feeder is full and tries to get a few items at a time and spawn them at once, for efficiency.By the way might be it does not work for you due to this, and not due to my initial assumption.Anyway this won't change: as parallel_while has been "replaced" with parallel_do, we don't want investing to it, and intend to deprecate it over time. This is the showstopper reason but I will give you a couple more :)
- with parallel_while you only can implement two stages. If your processing is logically divided into more stages, and especially if some stages are "ordered" (i.e. should preserve the initial order of data), pipeline seems a better fit.
- with pipeline, you have some control over how many pieces of data are processed simultaneously - you set the limit before pipeline execution, thoughcan't change itdynamically. The intent is to impose space-driven restrictions. With parallel_while, you don't have control over how many data pieces are taken out at once, neither over how many pieces are processed at any given time.
A lowlight: so far, there is a known issue that worker threads could spin idle if the pipeline input stage is blocked waiting for data. But, the fix for the issue is ready and will be available in the next OSS developer release of TBB.
I edited the above post trying to make clarifications, but still this is not enough.
If the Pipe->Read blocks, certainly you don't need any loop (so disregard related parts of the post), and your design is fine for pipeline (except for required interface changes) but not for parallel_while due to the "showstopper reason" I explained above. In your testing, if you issued a number of requests (four, if I remember it right) you would see their processing started - but this is not what you want to get :)
I hope I was able to explain the reasons of the behavior you saw, and also to give you some perspective :)
The pipeline works great, even with a blocking Read call. There's only one small issue that I have a work around for, but I'd appreciate a more elegant solution. If nTokens for the pipeline is greater than nThreads then the second step of the pipeline doesn't always get called. If nTokens <= nThreads then it works as I'd expect. Example:
Pipeline step1: Open pipe, read from pipe, pass command to step 2
Pipeline step2: Process command, send result back on pipe, close pipe
If I start the server with nTokens==nThreads (in my case 2 for my dual core test machine) Then I start three clients, client 1 sends a long job, clients 2 and 3 send something quick to process.
The server starts processing client 1, while it's processing, it gets clients 2 and 3's request and processes them and returns the answer, then when it finally finishes 1 it retuns the answer, just what I expect.
But if I do the same thing with nTokens>nThreads (I tested values of nTokens from 3 up to 10), Then client 1's job starts, when client 2 and 3 send their request, only one of them gets returned, the other never starts even when 1's job is done.
I'm suspecting that this is something like the parallel_while issue where to save time it's trying to batch start a bunch of steps once all the tokens are full?
Is there a way in tbb to ask how many worker threads have been started (if tbb::task_scheduler_init::automatic was used) so that I can make nTokens scale properly? This can be running on many different type of hardware where I'm not going to know how many processors/cores are available in advance.
What you have described is a known flaw in the "old" design of the pipeline (and I will explain why I called it "old"). In that design, nTokens items are taken out of input stage right after the pipeline started, and passed to stage 2. Of those, nThreads items pass stage 2. But after that, instead of processing the rest of items already taken from input, every worker wanted to take a new item from the input. Not a good idea, and besides, that wasthe reason of idle spinning because the input stage is always serial and if one worker is blocked there, all other workers spin idle in attempts to enter the input stage. In your case, if your input stage would return NULL at some moment, you would see the third item being processed after that. Again, not quite what you want.
The idle spinning issue was reported to us, and as I said, we have the fix ready for it. The good news for you is that the fix actually refactors that pipeline design (which I now call "old") with a much better one, and this issue with "hanging" items should go away as well; now no items are pre-taken from input, a worker only go for next item when it can't proceed with current items.
We plan to release the update this week or might be early next week, and it will contain the fix. Meanwhile, in current development releases, there is functionality to detect the default number of workers created by task_scheduler_init if no explicit number was given. Use task_scheduler_init::default_num_threads for it. Alas this function is not available in any com-aligned release.
Today's blog post: http://softwareblogs.intel.com/2008/02/26/threading-building-blocks-commercial-aligned-version-20_017/ leads me to think that it has, but no changes for the pipeline were mentioned. I am able to get out of the CAL releases and am looking for the most stable release that has the new pipeline fixes.
Mike, it doesn't look like you'll have to wait long. A new development release was posted today, and includes pipeline performance improvements. I don't know if this contains everything Alexey discussed earlier in this thread, but the spin waiting has been removed, according to the CHANGES file