Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Parallel_for with small grainsize gets stuck

Anssi_K_
Beginner
1,219 Views

Hi,

I have tested my my code with tbb 4.0 and 4.2 and both have same problem. I use Fedora 18 Linux, kernel 3.10.14. Processor is 4 core AMD Phenom II X4 955. Compiler is gcc 4.7.2. Code is compiled with command: g++ -Wall -g  -o stuck -ltbbmalloc -ltbb -std=c++11 stuck.cpp.

Now serial for and parallel for with so large grainsize that one core does all the job works. I.e. program prints twice numbers 0-79. However in the second parallel for the grainsize is 1, program prints nothing and get stuck.

[cpp]

#include <iostream>
#include <tbb/task_scheduler_init.h>
#include <tbb/blocked_range.h>
#include <tbb/parallel_for.h>
#include <tbb/concurrent_queue.h>
#include <tbb/parallel_invoke.h>

const int NUM_OF_QUEUES = 8;
const int GRAINSIZE_1   = 10;
const int GRAINSIZE_2   = 1;

typedef tbb::concurrent_bounded_queue<int> int_queue_t;
typedef std::vector<int_queue_t> int_queue_vector_t;

int_queue_vector_t int_queues;

void generate_ints(int num_of_ints, int index)
{
    for (int i = index * num_of_ints; i < (index + 1) * num_of_ints; ++i) {
    int_queues[index].push(i);
    }
    int_queues[index].push(-1);
}

void print_ints(int index)
{
    int i;
   
    int_queues[index].pop(i);
    while (i >= 0) {
    std::cout << i << '\n';
    int_queues[index].pop(i);
    }
}

void generate_and_print_ints(int num_of_ints, int index)
{
    int_queues[index].set_capacity(1);
   
    tbb::parallel_invoke(
    [&] { generate_ints(num_of_ints, index); },
    [&] { print_ints(index); });
}

int main(int argc, char **argv)
{
    const int num_of_ints = 10;
   
    tbb::task_scheduler_init init;
   
    int_queues.resize(NUM_OF_QUEUES);
 
    for (int i = 0; i < NUM_OF_QUEUES; i++) {
    generate_and_print_ints(num_of_ints, i);
    }
   
    tbb::parallel_for(
    tbb::blocked_range<size_t>(
        0,
        NUM_OF_QUEUES,
        GRAINSIZE_1),
    [&] (const tbb::blocked_range<size_t>& range) {
        for (size_t i = range.begin();  i != range.end(); ++i) {
        generate_and_print_ints(num_of_ints, i);
        }
    });
   
    tbb::parallel_for(
    tbb::blocked_range<size_t>(
        0,
        NUM_OF_QUEUES,
        GRAINSIZE_2),
    [&] (const tbb::blocked_range<size_t>& range) {
        for (size_t i = range.begin();  i != range.end(); ++i) {
        generate_and_print_ints(num_of_ints, i);
        }
    });
}

[/cpp]

Regards, Anssi

0 Kudos
12 Replies
RafSchietekat
Valued Contributor III
1,219 Views

Your program is bound (no pun intended) to fail at least sometimes because of the required concurrency. As long as you have fewer potential producers than software threads, at least one consumer will always be executed to clean up its queue, and so on and on until the program finishes. But otherwise it may well happen that you flood the task scheduler with producers, which will fill up the corresponding queues and block before they can push the -1 and finish, causing deadlock.

My expectation is that, starting with 4 producer tasks, which means grainsize 5 or less because there are 10 queues, the percentage of failed runs gradually increases as the number of queues becomes larger and grainsize smaller.

You should redesign your program to not require concurrency (TBB is great with optional parallelism, but there's no preemption), or use plain old threads instead.

0 Kudos
Greg__Trost
Beginner
1,219 Views

Does this mean, if I take two bodies of code, each which can successfully use TBB, and combine them into a single program that runs both bodies of TBB-based code concurrently, I can end up with a program that no longer runs? This isn't a hypothetical question, I think I've seen this sort of thing before.

0 Kudos
jiri
New Contributor I
1,219 Views

If those two parts are well designed, then running them concurrently may be (at worst) bad for the performance, but it should not result in a faulty program. The original software in this thread required high enough concurrency (or that the scheduler is really lucky) to work. If the required number of worker threads wasn't available or the scheduler wasn't lucky (work-stealing is random), it deadlocked. That is not good design.

If you only use the algorithms provided by TBB and don't use any locking in your code, then your program should not get stuck no matter what... If you need the locks or you use the scheduler directly (creating and spawning you own tasks), you can still create programs that run just fine, but you have to be much more careful.

0 Kudos
RafSchietekat
Valued Contributor III
1,219 Views

I don't know the sufficient or necessary conditions for composability, but not requiring parallelism remains a valuable guideline.

(jiri's statement about "well designed" seems more like a definition than a guarantee, and you need look no further than the given program for an illustration of how you can get yourself into trouble with just standard data structures and algorithms.)

(Corrected 2013-10-24) Toned down the description of the guideline.

0 Kudos
jiri
New Contributor I
1,219 Views

The example uses tbb::concurrent_bounded_queue, which is also a lock - you can wait on it (possibly indefinitely) and create deadlock that way.

And you are right, the "well designed" part was supposed to mean "you should build your software carefully, to make sure that ..."

0 Kudos
RafSchietekat
Valued Contributor III
1,219 Views

I think it's a bit of a stretch to say that a bounded queue is a lock just because its methods can block.

But my point was that there's more to it than these statements seemed to imply. If you give a guideline, make clear it's a guideline. If you state something that looks like a guarantee, make sure it's valid and not open to interpretation. There's a difference between understanding something and being able to communicate that understanding, and it's counterproductive to believe or pretend otherwise.

0 Kudos
Greg__Trost
Beginner
1,219 Views

Thanks for the clarifications.

So, to go back to Annsi's original question, I think that the sample program should be rewritten to use TBB's parallel_pipeline or its equivalent, where one pipeline element generates the integers and the other prints them. That gives him the same behavior of things flowing through, but avoids the deadlock.

Bill

0 Kudos
jiri
New Contributor I
1,219 Views

Raf, what I meant was, that the queue can act as a lock in the wait-for graph, forming a deadlock.

Not being a native English speaker, I'm not really up to a discussion about what looks like a guideline/guarantee and what does not. Despite that, I still think that a TBB-based program should be designed so that it can run fine even if available concurrency is limited. And I still don't know of any example where a standard TBB algorithm can deadlock if you don't introduce any lock in your code (directly or in a call). This is not true for the flow graph, but IIRC flow graph is not considered to be an algorithm.

I'm not sure I understnd the comment about understanding (no pun intended).

0 Kudos
RafSchietekat
Valued Contributor III
1,219 Views

jiri wrote:

I'm not sure I understnd the comment about understanding (no pun intended).

It seems clear to me that you understand what you're doing, but you're also writing things that are not strictly true (at face value), so perhaps somebody with less experience might not get the right meaning.

0 Kudos
jiri
New Contributor I
1,219 Views

OK, I'll try to be more careful about that the next time.

0 Kudos
Anssi_K_
Beginner
1,219 Views

I did as Bill suggested and changed my program to use parallel_pipeline. Now there are no more deadlocks.

[cpp]

#include <iostream>
#include <tbb/task_scheduler_init.h>
#include <tbb/blocked_range.h>
#include <tbb/parallel_for.h>
#include <tbb/pipeline.h>

const int NUM_OF_QUEUES = 8;
const int GRAINSIZE_1   = 10;
const int GRAINSIZE_2   = 1;

void generate_and_print_ints(int num_of_ints, int index)
{
    int i = index * num_of_ints;
   
    tbb::parallel_pipeline(1,
    tbb::make_filter<void, int>(
        tbb::filter::serial,
        [&](tbb::flow_control& fc)->int{
        if (i < (index + 1) * num_of_ints) {
            return i++;
        }
        else {
            fc.stop();
            return -1;
        }
        }
    ) &
    tbb::make_filter<int, void>(
        tbb::filter::serial,
        [] (int i) {
        std::cout << i << '\n';
        }
    )
    );
}

int main(int argc, char **argv)
{
    const int num_of_ints = 10;
   
    tbb::task_scheduler_init init;
 
    for (int i = 0; i < NUM_OF_QUEUES; i++) {
    generate_and_print_ints(num_of_ints, i);
    }
   
    tbb::parallel_for(
    tbb::blocked_range<size_t>(
        0,
        NUM_OF_QUEUES,
        GRAINSIZE_1),
    [&] (const tbb::blocked_range<size_t>& range) {
        for (size_t i = range.begin();  i != range.end(); ++i) {
        generate_and_print_ints(num_of_ints, i);
        }
    });
   
    tbb::parallel_for(
    tbb::blocked_range<size_t>(
        0,
        NUM_OF_QUEUES,
        GRAINSIZE_2),
    [&] (const tbb::blocked_range<size_t>& range) {
        for (size_t i = range.begin();  i != range.end(); ++i) {
        generate_and_print_ints(num_of_ints, i);
        }
    });
}

[/cpp]

0 Kudos
RafSchietekat
Valued Contributor III
1,219 Views

Note that this way you won't have full concurrency across all queues (only 4 queues are serviced at a time), and, generally speaking but probably not manifested here, concurrency could be reduced even further by mutual starvation (all pipelines are run in the same arena, so a thread might steal work from another pipeline and stay there for a while), but it is quite efficient (assuming real work in the pipeline instead).

You could achieve full concurrency by running each pipeline from its own application thread (heavy), or by using enqueued tasks (complicated), ...

0 Kudos
Reply