Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

How do I use TBB eliminate competition

Frank_F
Beginner
1,339 Views

    Hello everyone!     

When i use paralle_for parallel the program,then i use inspector detected there has data race(read and write race).But i cannot find a detailed methods that can used eliminate data race.Who can give me a suggestion.Thank you very much!

0 Kudos
21 Replies
Adri_C__S_
Beginner
1,233 Views

Hi Frank!

It's hard to help you with that little info. It's better if you post the code you have and maybe a screenshot of the Inspector when the race is detected.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

Read: http://en.wikipedia.org/wiki/Race_condition

to give you an overview of race conditions.

Jim Dempsey

0 Kudos
Frank_F
Beginner
1,233 Views

Adri C. S. wrote:

Hi Frank!

It's hard to help you with that little info. It's better if you post the code you have and maybe a screenshot of the Inspector when the race is detected.

 

It is read and write data race.Just like the picture show,the parameter jy appears read and write race.What can i do? If atomic ok? Or use mutexes or lock?

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

The question we need you to answer is:

a) Do all worker threads use the same [iter->number] range?
or
b) Does each worker have its own (different) [iter->number] range?

In the event of a) then consider using a reducer (each worker accumulates += values independently, which are then reduced once later).
In the event of b) then you only have an issue between one worker thread issuing first [iter->number] against adjacent worker thread issuing last [iter->number]. In this case you only require a mutex (or reducer) at the first and last iter->number's

Jim Dempsey

0 Kudos
Frank_F
Beginner
1,233 Views

jimdempseyatthecove wrote:

The question we need you to answer is:

a) Do all worker threads use the same [iter->number] range?
or
b) Does each worker have its own (different) [iter->number] range?

In the event of a) then consider using a reducer (each worker accumulates += values independently, which are then reduced once later).
In the event of b) then you only have an issue between one worker thread issuing first [iter->number] against adjacent worker thread issuing last [iter->number]. In this case you only require a mutex (or reducer) at the first and last iter->number's

Jim Dempsey

Can you give me some information about the case a:using a reducer,I am a novice.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

You can look at the TBB parallel_reduce examples or you might want to consult a recursion example such as: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\tbb\examples\task\tree_sum

(or wherever you installed the samples on Linux.

or (untested code)

double sum(double A[], size_t n)
{
  if(n < MIN_N_FOR_SUM)
  {
    double ret = 0.0;
    for(int i=0; i<n; ++i)
      ret += A;
    return ret;
  }
  else
  {
    size_t n1 = (n / 2) - ((n / 2) % (64 / sizeof(double)));
    size_t n2 = n - n1;
    double sum1, sum2;
    parallel_invoke(
      []() { sum1 = sum(A, n1); },
      []() { sum2 = sum(A+n1, n2); });
    return sum1 + sum2;
  }
}

Jim Dempsey

 

0 Kudos
Frank_F
Beginner
1,233 Views

jimdempseyatthecove wrote:

You can look at the TBB parallel_reduce examples or you might want to consult a recursion example such as: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\tbb\examples\task\tree_sum

(or wherever you installed the samples on Linux.

or (untested code)

double sum(double A[], size_t n)
{
  if(n < MIN_N_FOR_SUM)
  {
    double ret = 0.0;
    for(int i=0; i<n; ++i)
      ret += A;
    return ret;
  }
  else
  {
    size_t n1 = (n / 2) - ((n / 2) % (64 / sizeof(double)));
    size_t n2 = n - n1;
    double sum1, sum2;
    parallel_invoke(
      []() { sum1 = sum(A, n1); },
      []() { sum2 = sum(A+n1, n2); });
    return sum1 + sum2;
  }
}

Jim Dempsey

 

Thank you very much! my dear friend.My code like this,and i want it be parallel_for,then the problem had arisen.

for( iter=cell_n+3; iter!=iter_end-3; iter++ )
{                        
    iter->fp = (iter-1)->fp_old1 - pidt1 * (iter-1)->jy;    
    iter->gm = (iter-1)->gm_old1 - pidt1 * (iter-1)->jz;
    iter->fm = (iter+1)->fm - pidt1 * iter->jy;
    iter->gp = (iter+1)->gp - pidt1 * iter->jz;
    iter->ey = iter->fp + iter->fm;   iter->bz = iter->fp - iter->fm;
    iter->ez = iter->gp + iter->gm;   iter->by = iter->gp - iter->gm;
    iter->ex -= 2.0 * pidt1 * iter->jx;
}

in the parallel_for code,each thread accesses the "fm" and "gp" ,int the code 5th and 6th line,there will be any data race.How do I remove this competition?Thank you!

0 Kudos
RafSchietekat
Valued Contributor III
1,233 Views

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

0 Kudos
Frank_F
Beginner
1,233 Views

Raf Schietekat wrote:

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

Thank you! Could  you recommend me some  material to learn TBB, some books,some Web ? And also i have a question like 

   for(iter = part_n,iter_end=part_n+(vector_part.end()-vector_part.begin());iter!=iter_end;iter++)
   {
          has_to_change_cell( cell_n, iter );   // put particles on stack
   }
inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number-1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number-1].npart++;
		part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number+1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number+1].npart++;
		part->number++;
      }
}

Then between the threads ,

		"cell[part->number].np[part->species]"will have data race ,what can i do?Thank you very much!
 
0 Kudos
Frank_F
Beginner
1,233 Views

Raf Schietekat wrote:

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

Thank you! Could  you recommend me some  material to learn TBB, some books,some Web ? And also i have a question like 

   for(iter = part_n,iter_end=part_n+(vector_part.end()-vector_part.begin());iter!=iter_end;iter++)
   {
          has_to_change_cell( cell_n, iter );   // put particles on stack
   }
inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number-1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number-1].npart++;
		part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number+1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number+1].npart++;
		part->number++;
      }
}

Then between the threads ,

		"cell[part->number].np[part->species]"will have data race ,what can i do?Thank you very much!
 
0 Kudos
RafSchietekat
Valued Contributor III
1,233 Views

At TBB's own website you will find the online "User Guide and Design Patterns", and a picture of the book "Intel Threading Build Blocks" with a link to a large online shop for books and other stuff. The book might not be up to date anymore, but it's still a good read.

Unless the problem has some relevant structure that I missed, you could make np and npart atomic.

0 Kudos
Frank_F
Beginner
1,233 Views

Raf Schietekat wrote:

At TBB's own website you will find the online "User Guide and Design Patterns", and a picture of the book "Intel Threading Build Blocks" with a link to a large online shop for books and other stuff. The book might not be up to date anymore, but it's still a good read.

Unless the problem has some relevant structure that I missed, you could make np and npart atomic.

Thank you! I tried to use atomic operations before,may be i wrote the wrong statement,could you tell me which the statement I can use?

0 Kudos
RafSchietekat
Valued Contributor III
1,233 Views

++ and -- are implemented by tbb::atomic (at least for integers).

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

While you can solve this with using atomics, the problem is with performance. I would suggest something along the following:

inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
   {
  cell[part->number].np[part->species]--;
  cell[part->number].np_prior[part->species]++;
  cell[part->number].npart--;
  part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
   {
  cell[part->number].np[part->species]--;
  cell[part->number].np_following[part->species]++;
  cell[part->number].npart--;
  part->number++;
      }
}

inline int propagate::get_np( struct cell *cell, struct particle *part )
{
 return cell[part->number].np[part->species]
   + cell[part->number-1].np_following[part->species]
   + cell[part->number+1].np_prior[part->species];
}

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

It goes without mentioning that you can run (occasionally) a parallel_for to reconcile (collapse) the np_following and np_prior into np.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
1,233 Views

I like the idea, but I'm afraid it won't work here, because the -1 and +1 are on "part->number", not on "part".

A simple situation that would fail is if all "number" and "species" values are the same for each *part.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

In both the original code and my suggested code there is a requirement that all part->number's are unique. Though, all part->species need not be unique.

My code suggestion removes the +1 and -1 on the cell subscript, and adds two arrays of size of the np array. The tally of the adjacent cells will contain the proper tallies for the changes.

These tallies are what would have been added/subtracted from the +1 and -1 entries (which can be summed when fetching the new counts).

Jim Dempsey

0 Kudos
Frank_F
Beginner
1,233 Views

jimdempseyatthecove wrote:

In both the original code and my suggested code there is a requirement that all part->number's are unique. Though, all part->species need not be unique.

My code suggestion removes the +1 and -1 on the cell subscript, and adds two arrays of size of the np array. The tally of the adjacent cells will contain the proper tallies for the changes.

These tallies are what would have been added/subtracted from the +1 and -1 entries (which can be summed when fetching the new counts).

Jim Dempsey

Thank you very much! There is the situation like Raf Schietekat says,A simple situation that would fail is if all "number" and "species" values are the same for each *part.In a cell number,there are some particles,the particles'number is same to the cell's number.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,233 Views

My assumption is:

You have a parallel_for that iterates on part numbers [0:nParts) and passes the part object pointer, such as pParts.

If this is the case, then no two threads will have the same part numbers, and thus, will not have the same indexes of cell.

*** however, you have an issue of the +1 and -1 at the boundaries of the thread subsections.

Note, you can insert into you code a test to see if the index of the current cell is the first or last iteration. In those cases, you can perform an atomic operation (sync_fetch_and_add), otherwise on interior locations simply ++ or -- as you do now

Jim Demspey

0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views

And my interpretation is that "number" means "cell number", with possibly multiple part belonging to the same cell. The iteration starts at part_n, which seems to be a random-access iterator, not an index.

0 Kudos
Reply