How use parallel_reduce to avoid race condition in parallel_for loop?

lascondes · ‎08-08-2009

I have the following code that representsa simplified version of a larger problem. I would like to have the outer for loop run in parallel. If I change the outer for loop to a parallel for I create a race condition for the contacts list held in each element and the contacts list holding all contacts.

The "lista" needs to be a std::list because I add and remove lots of elements outside of this part of the code.

In practice the "lista" will have 10,000 - 100,000 elements and the inner loop will have a much smaller subset of the entire list < 100. The simple example condition is really more complex involving more time consuming calculations.

a) With theIntel Threading Building Blocks can I use the parallel_for function for this problem?
b) What, and hopefully how,do I need toprotect the

(*it)->contacts.push_front(contact);
(*it2)->contacts.push_front(contact);
contactlist.push_front(contact);

to avoid a race condition?

c) Is there a way to use the parallel_reduce to combine the linked lists as the threads return and not have to use locks?

I have tried to keep the concurrent problem I am trying to solve as clear as possible in the code example below.

Thanks,

LasCondes

#include
#include
#include

using namespace std;

class Contact;

class Element {
public:
Element() {value = 0;}
Element(double Value) {value = Value;}
list contacts;
double value;
};

class Contact {
public:
Contact( Element *Ain, Element *Bin, double value) {A = Ain, B = Bin, Value = value;}
Element *A;
Element *B;
double Value;
};

int _tmain(int argc, _TCHAR* argv[])
{
int N = 15;

list lista;
for (int i=0; ilista.push_front(new Element(i));
}

list contactlist;

list::iterator it;
// I want to do this outer loop with a parallel_for or parallel_for_each
for (it=lista.begin(); it != lista.end(); it++) {
list::iterator it2 = it; // initialize it2 = it

for ( ++it2 /*increase it2*/; it2 != lista.end(); it2++) {
cout << (*it)->value << "," << (*it2)->value << " ";
double sum = (*it)->value + (*it2)->value;
if (sum == 4 || sum == 13) { // Simple example condition
Contact *contact = new Contact( *it, *it2, sum);

// How do I issolate these three calls?
(*it)->contacts.push_front(contact);
(*it2)->contacts.push_front(contact);
contactlist.push_front(contact);
// End How do I issolate these three calls?
}
}
cout << endl;
}

cout << "Output of contacts linked to the elements" << endl;
for (it=lista.begin(); it != lista.end(); it++) {
cout << (*it)->value << ": ";
list::iterator it3;
for (it3=(*it)->contacts.begin(); it3 != (*it)->contacts.end(); it3++) {
if ((*it3)->A == *it) { // looking for the memory location to be equal!
cout << (*it3)->B->value << " ";
} else {
cout << (*it3)->A->value << " ";
}
}
cout << endl;
}

cout << "Output of contacts" << endl;
list::iterator it3;
for (it3=contactlist.begin(); it3 != contactlist.end(); it3++) {
cout << (*it3)->A->value << " " << (*it3)->B->value << endl;
}

// Free memory
list::iterator it4;
for (it4=lista.begin(); it4 != lista.end(); it4++) {
delete *it4;
}
lista.clear();

list::iterator it5;
for (it5=contactlist.begin(); it5 != contactlist.end(); it5++) {
delete *it5;
}
contactlist.clear();

return 0;
}

Output from the above code:

14,13 14,12 14,11 14,10 14,9 14,8 14,7 14,6 14,5 14,4 14,3 14,2 14,1 14,0
13,12 13,11 13,10 13,9 13,8 13,7 13,6 13,5 13,4 13,3 13,2 13,1 13,0
12,11 12,10 12,9 12,8 12,7 12,6 12,5 12,4 12,3 12,2 12,1 12,0
11,10 11,9 11,8 11,7 11,6 11,5 11,4 11,3 11,2 11,1 11,0
10,9 10,8 10,7 10,6 10,5 10,4 10,3 10,2 10,1 10,0
9,8 9,7 9,6 9,5 9,4 9,3 9,2 9,1 9,0
8,7 8,6 8,5 8,4 8,3 8,2 8,1 8,0
7,6 7,5 7,4 7,3 7,2 7,1 7,0
6,5 6,4 6,3 6,2 6,1 6,0
5,4 5,3 5,2 5,1 5,0
4,3 4,2 4,1 4,0
3,2 3,1 3,0
2,1 2,0
1,0

Output of contacts linked to the elements
14:
13: 0
12: 1
11: 2
10: 3
9: 4
8: 5
7: 6
6: 7
5: 8
4: 0 9
3: 1 10
2: 11
1: 3 12
0: 4 13
Output of contacts
3 1
4 0
7 6
8 5
9 4
10 3
11 2
12 1
13 0

RafSchietekat · ‎08-09-2009

"a) With the Intel Threading Building Blocks can I use the parallel_for function for this problem?"
parallel_for() (or parallel_reduce(), for that matter) and std::list do not go together well (random access vs. sequential access). How exactly do you "add and remove lots of elements outside of this part of the code" (information required for choosing an alternative)?

"b) What, and hopefully how, do I need to protect the

(*it)->contacts.push_front(contact);
(*it2)->contacts.push_front(contact);
contactlist.push_front(contact);

to avoid a race condition?"
You could use an intrusive list with atomic operations (most efficient), or a good old spin_mutex (nothing wrong with that, especially if TBB is almost alone on the machine so its worker threadsare not likelyto get preempted while holding a lock), or perhaps concurrent_vector (somewhat heavy here, but OK for a first iteration just to get you going).

"c) Is there a way to use the parallel_reduce to combine the linked lists as the threads return and not have to use locks?"
What do you mean by that?

lascondes · ‎08-10-2009

Quoting - Raf Schietekat

"a) With the Intel Threading Building Blocks can I use the parallel_for function for this problem?"
parallel_for() (or parallel_reduce(), for that matter) and std::list do not go together well (random access vs. sequential access). How exactly do you "add and remove lots of elements outside of this part of the code" (information required for choosing an alternative)?

"b) What, and hopefully how, do I need to protect the

(*it)->contacts.push_front(contact);
(*it2)->contacts.push_front(contact);
contactlist.push_front(contact);

to avoid a race condition?"
You could use an intrusive list with atomic operations (most efficient), or a good old spin_mutex (nothing wrong with that, especially if TBB is almost alone on the machine so its worker threadsare not likelyto get preempted while holding a lock), or perhaps concurrent_vector (somewhat heavy here, but OK for a first iteration just to get you going).

"c) Is there a way to use the parallel_reduce to combine the linked lists as the threads return and not have to use locks?"
What do you mean by that?

a) This is part of a numerical simulation - an outer loop working through time. Elements are added individually at various points in time to one end of the list and as items move out of the modeling space they are removed. The removed items may be anywhere in the list when they are removed. So I am using the list to handle the variability in the length and because I am removing items at various locations in the list.

b) For the contacts lists one idea that has come up is to use a concurrent_queue as the length of the list may be 0 to 30+. Would this be reasonable?

c) At least for the contact list I should be able to create local copies in each thread and combine together as the threads exit creating the entire list. The order of the list is not important. I think this is done with the parallel_reduce in TBB but not sure. I dont see how this can be done with the contacts lists.

RafSchietekat · ‎08-11-2009

a) What are the constraints to visit the elements? Are the "it2"'s allowed to overtake each other?

b) Perhaps, although it wouldn't be my first choice. I wouldn't readily dismiss using even a locked std::vector, with enough elements reserve()'d, because then I would not have to pass by the memory allocator so often. It depends on the exact behaviour of the program and of the environment. Pick something that requires the least effort to code, get your program to work, and then perhaps try to find a better solution, if you think it will make a difference; then tell us. Or go straight to my first suggestion if you absolutely do not want to revise anything later on.

c) I'm guessing you suspect heavy contention there. parallel_reduce() is an algorithm with its own logic, and I don't think it is relevant here. Maybe if the threads were otherwise independent, you might benefit a lot from merging the data only at the very end, but since they're in each other's hair about the main list already, I wouldn't be too optimistic about doing much better than with whatever you chose for the contacts lists. Still, you could try to use thread-local storage to keep track of a slot in a global array-like collection of bags where the data can be added.

Maybe I should reread earlier postings, unless it's just a deja-vu, because I'm wondering whether inheriting data between tasks, and checking is_stolen_task() to know whether to override it, might be significantly faster than directly using thread-local storage?

Steve_Nuchia · ‎08-11-2009

parallel_reduce is functionally equivalent to parallel_for except that it accumulates and returns some result: like a for loop with an output variable.

The theoretical justification (and the name) come from functional programming but there's nothing of great value to the C++ programmer in that fact.

Mathematically, any commutative and associative operator T x T ->T works, so long as there is an identity element for x in T. Some valuey of type T is computed in each loop and accumlated like A_sub_thread = A_sub_thread x y. As subranges complete the join operation is invoked to combine subrange accumulators.

Practically, this works for concepts like "sum accross array" but also for "find the index of the largest element". You transofrm one into "subrange sum" and the other into "index of largest element in subrange".

The operation needs to be commutative and associative because you can't predict the order in which subranges will be combined. You do, however, control the order in which elements within a "leaf" subrange are combined so I guess some relaxation of those requirements is possible. But I wouldn't go there.

For "filter to list" the accumulator type is "list of items", the semantics of an accumulated list are "items selected from subrange", there is a trivial type conflict at the purely theoretical level, resolved by formally treating a single item as a one-entry list. And the join operation is just list concatenation.

-swn

Steve_Nuchia · ‎08-11-2009

General thought: you selected lists too early. If you don't care about order you are working with sets, multisets, or something like that. Describe your algorithm at the highest possible level first, then select the concrete data structures that best balances the competing requirements.

Also, your algorithm smells like an application of some pretty common graph algorithm building blocks. Can it be expressed in those terms?

Alexey-Kukanov · ‎08-11-2009

Quoting - Steve Nuchia

parallel_reduce is functionally equivalent to parallel_for except that it accumulates and returns some result: like a for loop with an output variable.

The theoretical justification (and the name) come from functional programming but there's nothing of great value to the C++ programmer in that fact.

Mathematically, any commutative and associative operator T x T ->T works, so long as there is an identity element for x in T. Some valuey of type T is computed in each loop and accumlated like A_sub_thread = A_sub_thread x y. As subranges complete the join operation is invoked to combine subrange accumulators.

Practically, this works for concepts like "sum accross array" but also for "find the index of the largest element". You transofrm one into "subrange sum" and the other into "index of largest element in subrange".

The operation needs to be commutative and associative because you can't predict the order in which subranges will be combined. You do, however, control the order in which elements within a "leaf" subrange are combined so I guess some relaxation of those requirements is possible. But I wouldn't go there.

For "filter to list" the accumulator type is "list of items", the semantics of an accumulated list are "items selected from subrange", there is a trivial type conflict at the purely theoretical level, resolved by formally treating a single item as a one-entry list. And the join operation is just list concatenation.

-swn

A small comment: TBB does not require the reduction operator to be commutative; associativity is enough for TBB's parallel_reduce. There is a guarantee that two partial results only joined iff they are for neighbour subranges, and the order of subranges is preserved by join.

Other than that TBB specific detail, you are absolutely correct.

lascondes · ‎08-12-2009

Raf: Not completely clear about what you mean about "it2"s overtaking each other. Ive got a list of N items and need to check for contact between each of them. Ive already done some work so that these items are very likely in contact.The order of the checks between it and its does not matter.

On the parallel_reducefor the contactlist I am looking at that because there is only one contact list for the algorithm that would need to be brought together from all the threads. For the contacts list (or array) there is one for each elementand each could be needed in multiple threads - much more complex.

Steve: Hadnt thought about set or multisets. Ill take a look at them. Thisis a contact search algorithm with simple objects - spheres. I need to keep track of the contacts between timesteps which makes things a little more complicated.

lascondes · ‎08-12-2009

Raf: Not completely clear about what you mean about "it2"s overtaking each other. Ive got a list of N items and need to check for contact between each of them. Ive already done some work so that these items are very likely in contact.The order of the checks between it and its does not matter.

On the parallel_reducefor the contactlist I am looking at that because there is only one contact list for the algorithm that would need to be brought together from all the threads. For the contacts list (or array) there is one for each elementand each could be needed in multiple threads - much more complex.

Steve: Hadnt thought about set or multisets. Ill take a look at them. Thisis a contact search algorithm with simple objects - spheres. I need to keep track of the contacts between timesteps which makes things a little more complicated.

RafSchietekat · ‎08-12-2009

"Raf: Not completely clear about what you mean about "it2"s overtaking each other. Ive got a list of N items and need to check for contact between each of them. Ive already done some work so that these items are very likely in contact. The order of the checks between it and its does not matter."
Well, with the serial algorithm, an "it2" sent off from a later "it" would implicitly not cause any of the elements to disappear before an "it2" sent off from an earlier "it" had a chance to visit them, which seemed kind of relevant. This would have complicated matters, like with ordered filters in a pipeline (looking at the implementation side, not the user side). But if it's no problem, then how about parallel_do or parallel_while or a two-stage pipeline? These should be able to deal with the non-random access to the sequence, growing at one end and suffering attrition anywhere.

"On the parallel_reduce for the contactlist I am looking at that because there is only one contact list for the algorithm that would need to be brought together from all the threads. For the contacts list (or array) there is one for each element and each could be needed in multiple threads - much more complex."
parallel_reduce would do nothing for you. There are only ever going to be a small number of worker threads (unless you're running this on a 4096-core machine or so), so you only need to worry about not forgetting any data before you bring it together, and managing worker threads should not be a concern for you (I don't think observing the scheduler should be part of the algorithm per se), that's why I suggested mapping threads to elements in a global array c/o thread-local storage, ideally with array elements large enough to prevent false sharing, and then each element in that global array can administer a subset of what will be the contactlist, with no need for mutual synchronisation before the merge. A list type seems appropriate for the subsets (extremely cheap to concatenate).