- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
// outer for execute in parallel
for(int i=r.begin(); i != r.end(); i++) {
parallel_for (blocked_range
// I will want that the second inner for starts execute when the previous for1 had ending
parallel_for (blocked_range
/ I will want that the third inner for starts execute when the previous FOR2 had ending
parallel_for (blocked_range
}
How Can i resolve this problem ?
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Added) Well, except that the name "barrier" does not seem entirely appropriate in the context of tasks instead of threads. Perhaps it's better to just say that all workrelated toparallel_for "happens before" anything that comes after it in the program, in the sense that all work is finished and all writes are visible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
During execute I see ( thanks cout<<"for1"; etc....) that the first , second , third for are execute in parallel but i will want that for 2 starts when for 1 is end and for 3 starts when for 2 is end.
Have you see that for1 , for 2, for 3 are inner at an OUTER loop ? Maybe the problem is about this but I am not sure.
Now I also have problem of segmentation fault while if I try only one for ALL is ok .
THanks for the answer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see, the outer loop is also parallel... Well, in TBB, parallel_for doesn't mean that the different tasks can progress at the same time, because that would mean required parallelism, and TBB is all about optional parallelism, soparallel_for will start executing one or more chunks in parallel, that may occupy all available worker threads but aren't necessarily a full partition of the complete range, before tackling more chunks as worker threads become available, and so the concept of an overallbarrier doesn't apply like it does with threads. Instead, you should probably distribute the contents of the outer body over successive invocations of the outer parallel_for, each of which happens before the next one. When doing that, consider whether each inner parallel_for really offers an opportunity for additional parallelism or merely increases parallel overhead, in which case it might as well be serial instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I face the same problem, each of my inner loop#1 must finish before loop#2 and loop#2 must finish before loop#3.... while outer loop is parallelized. I dont understand suggestion of last answer "Instead, you should probably distribute the contents of the outer body over successive invocations of the outer parallel_for, each of which happens before the next one. When doing that, consider whether each inner parallel_for really offers an opportunity for additional parallelism or merely increases parallel overhead, in which case it might as well be serial instead." Can it be explained a little? Thanks anyway
Do anybody have a good solution for this scene?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You wanted to do parallel_for(before;barrier;after). Instead, do parallel_for(before);parallel_for(after). Also, inside the before and after code, you may have little benefit or even an adverse effect from attempted parallelisation in case of a relatively small loop body with many iterations in the outer loop.
(Added 2011-09-08) With parallel_for iterating twice over the same range an affinity_partitioner may be useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>>>>>>>>>>>>>>>>>>>>>>Serial-version<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
for i=0:size
{
for j=i+1:size
------do something(only one line computation------
for j=i+1:size
for k=i+1:size
---do something(also just one line computation---
}
>>>>>>>>>>>>>I have tried parallelizing it like below(also u suggested the same)<<<<<<<<<<<<<
for i=0:size
{
parallel_for(i+1:size)
parallel_for(i+1:size)
}
But this has drastically slowed down my speedup even much below than serial execution:( I rather need a parallelized structure which parallelize from outer loop like:
>>>>>>>>>>>>>>>>>>>>>>>> Your suggested example<<<<<<<<<<<<<<<<<<<<<<<<<<<
parallel_for(i,size)
{
for j=i+1:size
------do something------
/*------SYNCHRONIZATION-BARRIER------------*/ This is what I need a "barrier"
for j=i+1:size
for k=i+1:size
---do something---
}
>>>>>>>>>>>>>>>>>>>>>>Another I tried<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
parallel_for(i,size)
{
parallel_for( j=i+1:size)
------do something------
parallel_for(for j=i+1:size)
for k=i+1:size
---do something---
}
i-e synchronization barrier of MPI kind that the second looping structure only begins when first looping has finished, so I guess i will get some speedup.... I have no idea if intel tbb has this kind of barrier, I looked through synchronization structures but they only talk about mutexes for shared variable accesses...
Both these versions give wrong output and to elaborate it more; i have implemented successfully this structure in OpenMP as follows:
>>>>>>>>>>>>>>>>>>>OpenMP Version<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#pragma omp parallel
for(i=0:size)
{
#pragma omp for
for j=i+1:size
------do something------
#pragma omp for
for j=i+1:size
for k=i+1:size
---do something---
}
Need urgent help with this, and thanks so much in advance....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for i=0:size
{
for j=i+1:size
------do something(only one line computation------
for j=i+1:size
for k=i+1:size
---do something(also just one line computation---
}
What you do not state is the contents of the "do something" By not stating this we cannot determine if your statements are temporially dependent. For example, if you were to change the serial version outer loop from
for i=0;size
to (pseudo code)
for (i=size-1; i .ge. 0; --i)
Then would the results be correct?
What if the i values in range 0:size-1 were taken in random order (once)?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
{
for#1 i=k+1:size-1
a
for#2 i=k+1:size-1
for j=k+1:size-1
a
}
This is what exactly I am doing. For#2 depends on for#1 and should be started after for#1 has finished, for a particular value of outer loop k. And also it seems outer loop should run in serial and not possible to parallelize, only inner loops could be parallelized and that too needs a barrier so for#2 starts only when for#1 finishes in any particular iteration of outer loop value k.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for k=0:size-1
{
parallel_for#1( k+1:size)
parallel_for#2(k+1:size-1)
}
The output is correct now but results too slow even many times more slow than serial version:(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You'll probably need to specify a grainsize (which may seriously restrict the level of parallelism), and probably an affinity_partitioner. I haven't looked closely at the code (yet), though.
(Added) Oh yes: http://lmgtfy.com/?q=gauss+elimination+tbb+site%3Asoftware.intel.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
static affinity_partitioner ap;
for(int k=0; k{
parallel_for(blocked_range(k, size, (size-k)/2), lud_division(), ap); /* for(i=r.begin()+1; i
parallel_for(blocked_range(k, size, (size-k)/2), lud_elimination(), ap); /* for(i=r.begin()+1; i }
But this gives performance not better than serial version. I wonder how could more this structure be parallelized as OpenMP gives great speedup even with 2 threads. My grain_size=(size-k)/2 since I got only two threads available. Any suggestions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But you should probably first try to work on this a bit more by yourself, e.g., by eliminating the first inner loop, not abusing grainsize to impose a desired number of threads, maybe exchanging the order of the innermost loops or using a 2-dimensional range, perhaps rolling your own blocked_range to take into account that not all subranges are created equal, certainly looking at what others have already discovered, etc. In software engineering as in other branches of engineering, some creativity beyond the straightforward implementation of a mathematical formula is often needed. After you've done that and taken new measurements, make sure to provide all the relevant information, because evidently otherwise you might unwittingly be withholding something essential.
As for why the OpenMP version seemed to fare better, I'm still interested in an expert opinion.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page