how realize a barrier ?

giacomo1988 · ‎11-24-2010

I have this situation:

// outer for execute in parallel

for(int i=r.begin(); i != r.end(); i++) {

parallel_for (blocked_range(0,m_Ndelay,1000), First_Loop (i,m_nel,m_direct_i,m_direct_q),simple_partitioner()); //FOR 1

// I will want that the second inner for starts execute when the previous for1 had ending

parallel_for (blocked_range(0,m_Ndelay+m_nel,800), Second_Loop(m_temp_echo_i,m_temp_echo_q),simple_partitioner()); //FOR 2

/ I will want that the third inner for starts execute when the previous FOR2 had ending

parallel_for (blocked_range(0,m_numDopp,100),Third_Loop(i,m_output,m_numDopp),simple_partitioner());

}

How Can i resolve this problem ?

Thanks

RafSchietekat · ‎11-24-2010

Each parallel_for implicitly executes a barrier before returning, so there is no problem.

(Added) Well, except that the name "barrier" does not seem entirely appropriate in the context of tasks instead of threads. Perhaps it's better to just say that all workrelated toparallel_for "happens before" anything that comes after it in the program, in the sense that all work is finished and all writes are visible.

giacomo1988 · ‎11-24-2010

Thanks for answer

During execute I see ( thanks cout<<"for1"; etc....) that the first , second , third for are execute in parallel but i will want that for 2 starts when for 1 is end and for 3 starts when for 2 is end.

Have you see that for1 , for 2, for 3 are inner at an OUTER loop ? Maybe the problem is about this but I am not sure.

Now I also have problem of segmentation fault while if I try only one for ALL is ok .

THanks for the answer

RafSchietekat · ‎11-24-2010

I see, the outer loop is also parallel... Well, in TBB, parallel_for doesn't mean that the different tasks can progress at the same time, because that would mean required parallelism, and TBB is all about optional parallelism, soparallel_for will start executing one or more chunks in parallel, that may occupy all available worker threads but aren't necessarily a full partition of the complete range, before tackling more chunks as worker threads become available, and so the concept of an overallbarrier doesn't apply like it does with threads. Instead, you should probably distribute the contents of the outer body over successive invocations of the outer parallel_for, each of which happens before the next one. When doing that, consider whether each inner parallel_for really offers an opportunity for additional parallelism or merely increases parallel overhead, in which case it might as well be serial instead.

akhal · ‎06-08-2011

Hej

I face the same problem, each of my inner loop#1 must finish before loop#2 and loop#2 must finish before loop#3.... while outer loop is parallelized. I dont understand suggestion of last answer "Instead, you should probably distribute the contents of the outer body over successive invocations of the outer parallel_for, each of which happens before the next one. When doing that, consider whether each inner parallel_for really offers an opportunity for additional parallelism or merely increases parallel overhead, in which case it might as well be serial instead." Can it be explained a little? Thanks anyway

Do anybody have a good solution for this scene?

RafSchietekat · ‎06-08-2011

"Can it be explained a little?"
You wanted to do parallel_for(before;barrier;after). Instead, do parallel_for(before);parallel_for(after). Also, inside the before and after code, you may have little benefit or even an adverse effect from attempted parallelisation in case of a relatively small loop body with many iterations in the outer loop.

(Added 2011-09-08) With parallel_for iterating twice over the same range an affinity_partitioner may be useful.

akhal · ‎06-09-2011

Thanks for kind response. But I have already tried that and no speedup with that. I must write in little detail below:
>>>>>>>>>>>>>>>>>>>>>>>>Serial-version<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
for i=0:size
{
for j=i+1:size
------do something(only one line computation------

for j=i+1:size
for k=i+1:size
---do something(also just one line computation---
}

>>>>>>>>>>>>>I have tried parallelizing it like below(also u suggested the same)<<<<<<<<<<<<<
for i=0:size
{
parallel_for(i+1:size)

parallel_for(i+1:size)
}

But this has drastically slowed down my speedup even much below than serial execution:( I rather need a parallelized structure which parallelize from outer loop like:
>>>>>>>>>>>>>>>>>>>>>>>> Your suggested example<<<<<<<<<<<<<<<<<<<<<<<<<<<
parallel_for(i,size)
{
for j=i+1:size
------do something------

/*------SYNCHRONIZATION-BARRIER------------*/ This is what I need a "barrier"

for j=i+1:size
for k=i+1:size
---do something---
}
>>>>>>>>>>>>>>>>>>>>>>Another I tried<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
parallel_for(i,size)
{
parallel_for( j=i+1:size)
------do something------

parallel_for(for j=i+1:size)
for k=i+1:size
---do something---
}
i-e synchronization barrier of MPI kind that the second looping structure only begins when first looping has finished, so I guess i will get some speedup.... I have no idea if intel tbb has this kind of barrier, I looked through synchronization structures but they only talk about mutexes for shared variable accesses...
Both these versions give wrong output and to elaborate it more; i have implemented successfully this structure in OpenMP as follows:
>>>>>>>>>>>>>>>>>>>OpenMP Version<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#pragma omp parallel
for(i=0:size)
{
#pragma omp for
for j=i+1:size
------do something------

#pragma omp for
for j=i+1:size
for k=i+1:size
---do something---
}

Need urgent help with this, and thanks so much in advance....

jimdempseyatthecove · ‎06-09-2011

You state your serial program is:

for i=0:size
{
for j=i+1:size
------do something(only one line computation------
for j=i+1:size
for k=i+1:size
---do something(also just one line computation---
}

What you do not state is the contents of the "do something" By not stating this we cannot determine if your statements are temporially dependent. For example, if you were to change the serial version outer loop from

for i=0;size

to (pseudo code)

for (i=size-1; i .ge. 0; --i)

Then would the results be correct?

What if the i values in range 0:size-1 were taken in random order (once)?

Jim Dempsey

akhal · ‎06-09-2011

for k=0:size-1
{
for#1 i=k+1:size-1
a = a/a

for#2 i=k+1:size-1
for j=k+1:size-1
a=a - (a*a)
}

This is what exactly I am doing. For#2 depends on for#1 and should be started after for#1 has finished, for a particular value of outer loop k. And also it seems outer loop should run in serial and not possible to parallelize, only inner loops could be parallelized and that too needs a barrier so for#2 starts only when for#1 finishes in any particular iteration of outer loop value k.

akhal · ‎06-11-2011

And also if I try like:

for k=0:size-1
{
parallel_for#1( k+1:size)

parallel_for#2(k+1:size-1)
}

The output is correct now but results too slow even many times more slow than serial version:(

RafSchietekat · ‎06-11-2011

Still wondering how the OpenMP version could have provided correct output then (I would think that it would potentially eliminate parallelism from inner loops, not outer loops), or be more performant. Does anybody have answers to that? I'd like to know...

You'll probably need to specify a grainsize (which may seriously restrict the level of parallelism), and probably an affinity_partitioner. I haven't looked closely at the code (yet), though.

(Added) Oh yes: http://lmgtfy.com/?q=gauss+elimination+tbb+site%3Asoftware.intel.com

akhal · ‎06-18-2011

Ofcourse OpenMp code works since parallel threads in each outer loop are employed in each iteration such that they combinedly complete iterations of first inner loop and then synchronize on returning from this loop, and start working combinedly and 2nd inner loop and then return synchronized again to go for next out loop iteration and so on. I want this implementation in TBB too, but so far I could only do like:


static affinity_partitioner ap;
 for(int k=0; k {
  parallel_for(blocked_range(k, size, (size-k)/2), lud_division(), ap); /* for(i=r.begin()+1; i
  parallel_for(blocked_range(k, size, (size-k)/2), lud_elimination(), ap); /* for(i=r.begin()+1; i }

But this gives performance not better than serial version. I wonder how could more this structure be parallelized as OpenMP gives great speedup even with 2 threads. My grain_size=(size-k)/2 since I got only two threads available. Any suggestions?

RafSchietekat · ‎06-19-2011

Saying "of course" doesn't make it so. Is this OpenMP behaviour even documented? Is the compiler doing some fancy optimisations with data flow control to avoid the race inherent in the provided source code?

But you should probably first try to work on this a bit more by yourself, e.g., by eliminating the first inner loop, not abusing grainsize to impose a desired number of threads, maybe exchanging the order of the innermost loops or using a 2-dimensional range, perhaps rolling your own blocked_range to take into account that not all subranges are created equal, certainly looking at what others have already discovered, etc. In software engineering as in other branches of engineering, some creativity beyond the straightforward implementation of a mathematical formula is often needed. After you've done that and taken new measurements, make sure to provide all the relevant information, because evidently otherwise you might unwittingly be withholding something essential.

As for why the OpenMP version seemed to fare better, I'm still interested in an expert opinion.