tbb:atomic<> operations on elements of vector (double*)

Handsome_T · ‎09-21-2012

Hi everyone,

I am trying to parallelize a subroutine which populates content of a double vector using parallel_for

Due to nature of population, racing condition will certainly occur so I need to somehow guard against that.

The vector is beging defined/allocated as a double v.

What's the correct way of passing this vector to the operator() so I can use tbb::atomic.

I was thinking of doing something like this and wanted to know if this is enough

to make sure that [cpp]my_v += some_new_value;[/cpp] is atomic.

[cpp]

double *v = new double;

tbb:atomic<double *> va = v;

parallel_for(tbb::blocked_range<size_t>(0,N), Foo(va))

[/cpp]

where Foo is defined as:

[cpp]

class Foo {

double *my_v;

public:

void operator()(tbb:blocked_range<size_t>& r) const {

for (size_t i=r.begin(); i<r.end(); ++r) {

size_t idx = _compute_index(i);

my_v[idx] += some_new_value;

}

Foo(double *in_v) : my_v(in_v) { }

};

[/cpp]

(Line 6 above can produce identical values of 'idx' on different threads so I have to make sure line 7 is atomic.)

thanks in advance for any help/input.

RafSchietekat · ‎09-21-2012

You should define an array of atomics, not an atomic pointer to a non-atomic array.

Handsome_T · ‎09-21-2012

Raf Schietekat wrote:

You should define an array of atomics, not an atomic pointer to a non-atomic array.

Thanks for your reply Raf. Since the subroutine is supposed to return that array to the calling function in a double result array, how do I convert the array of atomics to a regular array of doubles if I have memory constraints and I can not allocate a new array of size N of doubles? thanks again

RafSchietekat · ‎09-21-2012

You can always try casting, since atomic types occupy the same space as the underlying type, either the array or individual elements should do. But wait, isn't += unimplemented for atomic double? And if you really have concurrent access, using an array may cost a lot of performance because of false sharing.

Handsome_T · ‎09-21-2012

Raf Schietekat wrote:

But wait, isn't += unimplemented for atomic double?

oh, yes you are right. This parallelization won't be as easy as I thought it would be... maybe scope locking?

RafSchietekat · ‎09-22-2012

Assuming it's the right thing to do (we only know this aspect), you could roll your own += based on CAS (pun intended).

Handsome_T · ‎09-24-2012

Assuming you are referring to 'compare & swap', rolling my += can be a very good exercise for me. However, since the array in question is from a finite element mesh, I can either:

partition the mesh and then build the array for each partition by iterating over elements In that case, the racing condition will only happen for the elements lying on the interface of partitions. I am hoping, I can then handle those elements separately (either by a single thread or using a rather expensive lock mechanism).

Or instead of iterating over elements, I do over nodes. The drawback of this approach is that the code will need to do some extra book-keepings and will require more memory.

By the way, I tried the spin_lock mutex approach and the performance got a big hit (as expected). Would you mind telling me which files of tbb I should be looking to if I want to do my own += ? thanks again for your help

jimdempseyatthecove · ‎09-25-2012

You have a 3rd method, a little more code, no additional data. Partition the elements (typically called tiling). Identify shared boundaries parallel-ize interior of tile (plus unshared boundaries/perimiter tiles) (join) of shared boundaries, identify cells that intersect (cross points of tile boundaries) parallel-ize non-intersecting portion of boundaries (join) serialize or parallize intersection of boundaries (note, intersection of boundaries non-adjacent unless tile is 3x3 or smaller) Jim Dempsey

Handsome_T · ‎09-26-2012

JimDempseyAtTheCove wrote:
(join)
of shared boundaries, identify cells that intersect (cross points of tile boundaries)
parallel-ize non-intersecting portion of boundaries
(join)

Jim, thanks for your help. I am not completely following the logic here. Aren't the 'non-intersecting portions of boundaries' already taken care of in the first step when we parallelize interior of partitions.

JimDempseyAtTheCove wrote:
(note, intersection of boundaries non-adjacent unless tile is 3x3 or smaller)

Would you mind elaborating on this one as well? thanks again

jimdempseyatthecove · ‎09-26-2012

>>Aren't the 'non-intersecting portions of boundaries' already taken care of in the first step when we parallelize interior of partitions. When the elements are wholely contained then no intersections. However, when elements interact with adjacent elements (e.g. beam segment ends in Finite Element simulation), then you have a boundary issue with intersections where perimiter cells of one tile interact with (adjacent) perimiter cells of a neighboring cell. When calculations are such that multiple threads could potentially update the same cell (variable) then you need to take proactive action such as critical section, compare and swap loop, or thread scheduling manuvers to avoid issues (or serialization). In 3x3 where cell computations interact, then no interacting computation can be made without interacting with a perimiter cell. Same with 3xn or nx3 or smaller. Example, in Finite Element simulation you may choose to parallel_for slice up the beams, determine the strain, and thus the tension force, and then accumulate the force at the beam's end point objects (connection points). Beams in adjacent slices may (will) share endpoints along the perimiter of the slice (tile). This presents a potential for two threads performing EndPoint(N) += Force; EndPoint(N+1) -= Force; When an end point connects beams from adjacent tiles (slices) then the potential is for two threads performing Read, Add, Store at the same time. This will introduce an error in the calculation. Jim Dempsey