2463 Discussions

## Help on parallel_for

Beginner
706 Views

Hi All,

I am having problem with parallel_for. Basically, I am trying toparallelize the two dimensional FFT of a squared matrix.the function, apply_fft_column( ), implements column-wiseFFT of two NxN square matrices, Via and Vqa. A similar function is used row-wise. The functions work fine serially. When I use it with parallel_for as shown in the codebelow, the matrices return with wrong results. MatDoub and VecDoubare matrix andvector classes,implemented by the new Numerical Recipe. I have used parallel_for successfully before, with separate input and output matrices. In this case, the matrices Via and Vqa are being used both for input and output. Any suggestion? Thanks!

Regards,

Long

`void apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa ){   VecDoub fft_real_in(N);   VecDoub fft_imag_in(N);   VecDoub fft_real_out(N);   VecDoub fft_imag_out(N);   for(int i=0; i   {       fft_real_in = Via;       fft_imag_in = Vqa;       fft_real_in[i+1] = -Via[i+1];       fft_imag_in[i+1] = -Vqa[i+1];   }   fft(N,&fft_real_in[0],&fft_imag_in[0],&fft_real_out[0],&fft_imag_out[0]);   for(int i=0; i   {       Via = fft_real_out;       Vqa = fft_imag_out;   } }class parallel_forward_fft_column{   const int my_N;   MatDoub &my_Via;   MatDoub &my_Vqa;   public:   void operator()(const blocked_range<int>& r) const   {      // const declaration for input data into functor       const int N = my_N;      // non-const reference for output data back to the caller      MatDoub &Via = my_Via;      MatDoub &Vqa = my_Vqa;      for(int j=r.begin(); j!=r.end(); ++j)           apply_fft_column(j,N,Via,Vqa);    }    // default constructor    parallel_forward_fft_column(const int N,                                MatDoub &Via,MatDoub &Vqa) :                                 my_N(N),                                my_Via(Via),my_Vqa(Vqa)    {    }};// auto grain sizevoid Parallel_forward_fft_column(const int N, MatDoub &Via, MatDoub &Vqa){     parallel_forward_fft_column parallel_fft_column(N,Via,Vqa);     parallel_for(blocked_range<int>(0,N),parallel_fft_column,auto_partitioner());}`
14 Replies
Beginner
706 Views
`Some new info on the parallel_for problem. Since the last post, `
`I've discovered that the fft() routine that I used is not thread safe. `
`It uses global variables. I've replaced it with the routine, four1(), `
`from Numerical Recipe. The function to be parallel is now:`
`void apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa){VecDoub data(2*N);int k=0;for(int i=0; i{data = Via;data[k+1] = Vqa;data[k+2] = -Via[i+1];data[k+3] = -Vqa[i+1];k+=4;}four1(data,-1);int l=0;for(int i=0; i{Via = data;Vqa = data[l+1];l+=2;} }With the new routine, the parallel_for from TBB is still giving wrong results. However, if I compiled the code using OpenMP library as shown below provided by Microsoft Visual studio 2008 professional compiler, the routine gives correct answer.`
`#pragma omp parallel for for(int j=0; j{apply_fft_column(j,N,Via,Vqa);}Am I doing something wrong with parallel_for? Perhaps parallel_for can not be used this way?Any suggestions, Thanks!Regards,LongP.S, I am using TBB v2.0 (tbb20_020oss) commercial aligned release.`
New Contributor III
706 Views
At the first glance your TBB code looks OK (except a few tiny nits that are not important right now). Just to be sure, when you compiled OpenMP version of your app, have you added "/openmp" command line option to CL? Without it pragma omp is silently ignored and the code works serially. Thus you could just have overlooked some other problem in the transformation code

Beginner
706 Views

Yes, /openmp is enabled in the project. The application requires vcomp90.dll and definitely run in parallel mode. The interesting thing is thatthe application actually uses mostly TBB and only openmp for the parallel FFT code. I have also ported the application to use openmp solely, and it works just fine. I would like to find out what I do wrong with parallel_for. Any help will be greatly appreciated!

Regards,

Long

P.S. The 2 dimentional parallel FFT routine, my first thought, seems to be an easy program to write. Each FFT is independent from each other, and ease of implementation in openmp confirms it. There must be something I overlook in parallel_for code.

Employee
706 Views

Andrey and metried hard to find a problem in how you use TBB, without any success. Also I looked at four1 (as listed at http://www.ddj.com/cpp/199500857); the function uses only local variables and arguments, and, being called for a local array of data, is thread-safe.

The only weird suggestion I have is to explicitly define a copy constructor in your class used with parallel_for. There is nothing in the class that would cause troubles for a compiler to generate an implicit copy constructor, but who knows...

Also you could try a newer version of TBB, e.g. the most recent commercial-aligned release corresponding to just-released TBB 2.1. It's unlikely it will change anything for you, but again, who knows...

Beginner
706 Views

Thanks. I'll try the suggestions and report back the results.

Regards,

Long

Beginner
706 Views

Hello Alexey, Andrey

I've tried your suggestions of using the explicitly defined copy constructors in my classesand the newTBB 2.1 (commercial aligned release.) I could not get the application to function correctly. Thanks!

Regards,

Long

Employee
706 Views

Hi, Long,

Just for testing, could you try defining a separate out matricesand using them to get the results back instead of writing the results back into the input matrices?

-wooyoung

Beginner
706 Views

I'll give that a try.

Long

Beginner
706 Views

Hello wooyoung,

It's still not working. I have removed the copy constructior from the classes. The code I used is shown below. Sorry for the format, I don't know how to get it back. May be you can spot the errors. Thanks!

p.s. I make sure the matrix size is power of 2. :)

`///////////////////////////////////////////////////////////////////////////////////////////////`
`void`
four1(Doub *data, const Int n, const Int isign) { ...
`}`
`void`
four1(VecDoub_IO &data, const Int isign) {
`four1(&data[0],data.size()/2,isign);`
`}`
`void`
_apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
`{`
`VecDoub data(2*N);`
int k=0;
for(int i=0; i
`{`
`data = Via_temp;`
`data[k+1] = Vqa_temp;`
`data[k+2] = -Via_temp[i+1];`
`data[k+3] = -Vqa_temp[i+1];`
`k+=4;`
`}`
`four1(data,-1);`
int l=0;
for(int i=0; i
`{`
`Via = data;`
`Vqa = data[l+1];`
`l+=2;`
`} `
`}`
`void`
_apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
`{`
`VecDoub data(2*N);`
`int k=0;`
for(int j=0; j
`{`
`data = Via_temp;`
`data[k+1] = Vqa_temp;`
`data[k+2] = -Via_temp[j+1];`
`data[k+3] = -Vqa_temp[j+1];`
`k+=4;`
`}`
`four1(data,-1);`
int l=0;
for(int j=0; j
`{`
`Via = data;`
`Vqa = data[l+1];`
`l+=2;`
`} `
`}`
`/////////////////////////////////////////////////////////////////////////////////////`
`class`
parallel_forward_fft_column
`{`
const int my_N;
`MatDoub_I &my_Via_temp;`
`MatDoub_I &my_Vqa_temp;`
`MatDoub_O &my_Via;`
`MatDoub_O &my_Vqa;`
public:
void operator()(const blocked_range<int>& r) const
`{`
// const declaration for input data into functor
const int N = my_N;
`MatDoub_I &Via_temp = my_Via_temp;`
`MatDoub_I &Vqa_temp = my_Vqa_temp;`
// non-con st reference for output data back to the caller
`MatDoub_O &Via = my_Via;`
`MatDoub_O &Vqa = my_Vqa;`
for(int j=r.begin(); j!=r.end(); ++j)
`_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);`
`}`
// default constructor
`parallel_forward_fft_column(`
const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
`MatDoub_O &Via, MatDoub_O &Vqa) :`
`my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa)`
`{`
`}`
`};`
`// auto grain size`
`void`
Parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
`{`
`parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);`
`parallel_for(blocked_range<`
int>(0,N),parallel_fft_column,auto_partitioner());
`}`
`class`
parallel_forward_fft_row
`{`
const int my_N;
`MatDoub_I &my_Via_temp;`
`MatDoub_I &my_Vqa_temp;`
`MatDoub_O &my_Via;`
`MatDoub_O &my_Vqa;`
`public`
:
void operator()(const blocked_ra nge<int>& r) const
`{`
// const declaration for input data into functor
const int N = my_N;
`MatDoub_I &Via_temp = my_Via_temp;`
`MatDoub_I &Vqa_temp = my_Vqa_temp;`
// non-const reference for output data back to the caller
`MatDoub_O &Via = my_Via;`
`MatDoub_O &Vqa = my_Vqa;`
for(int j=r.begin(); j!=r.end(); ++j)
`_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);`
`}`
// default constructor
`parallel_forward_fft_row(`
const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
`MatDoub_O &Via, MatDoub_O &Vqa) : `
`my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),`
`my_Via(Via),my_Vqa(Vqa)`
`{`
`}`
`};`
`// auto grain size`
`void`
Parallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
`{`
`parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);`
`parallel_for(blocked_range<`
int>(0,N),parallel_fft_row,auto_partitioner());
`}`
`int main(...)`
`{`
`   ...`
`task_scheduler_init init;`
`MatDoub Via(N,N),Vqa(N,N);`
`...`
`MatDoub Via_temp(N,N),Vqa_temp(N,N);`
for(int j=0;j
for(int i=0;i
`{`
`Via_temp = Via; `
`Vqa_temp = Vqa;`
`}`
`Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);...}`
Employee
706 Views

Hi,

Was it failing on both _row and _column versions?

I noticed that _apply_fft_column() is called in parallel_forward_fft_row::().

`class parallel_forward_fft_row`
`{`
`const int my_N;`
`MatDoub_I &my_Via_temp;`
`MatDoub_I &my_Vqa_temp;`
`MatDoub_O &my_Via;`
`MatDoub_O &my_Vqa;`
`public`
:
void operator()(const blocked_range<int>& r) const
`{`
// const declaration for input data into functor
const int N = my_N;
`MatDoub_I &Via_temp = my_Via_temp;`
`MatDoub_I &Vqa_temp = my_Vqa_temp;`
// non-const reference for output data back to the caller
`MatDoub_O &Via = my_Via;`
`MatDoub_O &Vqa = my_Vqa;`
for(int j=r.begin(); j!=r.end(); ++j)
`_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);`
`}`

And, I suspected that since the same matrices were used for inputas well asoutput, when the last row of a subrage (in the column-based version)was read,the output of the first rowof the next subrange was read instead of the inputrow.

Other than the two,the code looks o.k... Maybe you want toput the TBB versionnext to the OMPversion and compareeach rowto see where thetwo versions generate different results?

best regards,

-wooyoung

Beginner
706 Views

Thanks for the good catch! I knew I was doing something wrong.

Yes, it was my mistake when copying and pasting the source codes. I made another error by not re-initializing the Via_temp and Vqa_temp to new values of Via and Vqa After the function _apply_fft_column() call. After these two corrections, the program produces the correct result. The application generates 2D image as final output. So I can tell whether the FFTs are doing their job.

I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for. I guess, in this case, openMP is the way to go.

Regards,

Long

The final working code is shown below:

```///////////////////////////////////////////////////////////////////////////////////////////////void four1(Doub *data, const Int n, const Int isign) {}void four1(VecDoub_IO &data, const Int isign) {four1(&data[0],data.size()/2,isign);}void _apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){VecDoub data(2*N);int k=0;for(int i=0; i{data = Via_temp;data[k+1] = Vqa_temp;data[k+2] = -Via_temp[i+1];data[k+3] = -Vqa_temp[i+1];k+=4;}four1(data,-1);int l=0;for(int i=0; i
{Via = data;Vqa = data[l+1];l+=2;} }void _apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){VecDoub data(2*N);int k=0;for(int j=0; j{data = Via_temp;data[k+1] = Vqa_temp;data[k+2] = -Via_temp[j+1];data[k+3] = -Vqa_temp[j+1];k+=4;}four1(data,-1);int l=0;for(int j=0; j{Via = data;Vqa = data[l+1];l+=2;} }/////////////////////////////////////////////////////////////////////////////////////class parallel_forward_fft_column{const int my_N;MatDoub_I &my_Via_temp;MatDoub_I &my_Vqa_temp;MatDoub_O &my_Via;MatDoub_O &my_Vqa;public:void operator()(const blocked_range<int>& r) const{
;// const declaration for input data into functorconst int N = my_N;MatDoub_I &Via_temp = my_Via_temp;MatDoub_I &Vqa_temp = my_Vqa_temp;// non-const reference for output data back to the callerMatDoub_O &Via = my_Via;MatDoub_O &Vqa = my_Vqa;for(int j=r.begin(); j!=r.end(); ++j)_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);}// default constructorparallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,MatDoub &Via, MatDoub &Vqa) :my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa){}};// auto grain sizevoid Parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);parallel_for(blocked_range<int>(0,N),parallel_fft_column,auto_partitioner());}class parallel_forward_fft_row{const int my_N;MatDoub_I &my_Via_temp;MatDoub_I &my_Vqa_temp;MatDoub_O &my_Via;MatDoub_O &my_Vqa;public:void operator()(const blocked_range<int>& r) const{// const declaration for input data into functorconst int N = my_N;MatDoub_I &Via_temp = my_Via_temp;MatDoub_I &Vqa_temp = my_Vqa_temp;// non-const reference for output data back to the callerMatDoub_O &Via = my_Via;MatDoub_O &Vqa = my_Vqa;for(int j=r.begin(); j!=r.end(); ++j)_apply_fft_row(j,N,Via_temp,Vqa_temp,Via,Vqa);}// default constructorparallel_forward_fft_row(const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,MatDoub_O &Via, MatDoub_O &Vqa) : my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa){}};// auto grain sizevoid Parallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);parallel_for(blocked_range<int>(0,N),parallel_fft_row,auto_partitioner());}int main(int argc, char *argv[]){MatDoub Via(N,N),Vqa(N,N);...// initialize parallel code heretask_scheduler_init init;MatDoub Via_temp(N,N),Vqa_temp(N,N);for(int j=0;jfor(int i=0;i{Via_temp = Via; Vqa_temp = Vqa;}cout << "2D FFT to spatial domain..." << endl;Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);for(int j=0;jfor(int i=0;i{Via_temp = Via; Vqa_temp = Vqa;}Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);...}```
Employee
706 Views

long.to@navy.mil:
I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for.

Could you test your guess? It seems to me the actual problem is unrelated to in-place processing. In fact, TBB does nothing with your arrays except passing them by reference into Body objects and across threads.I think this should not prevent in-place processing as long as there is no data dependency between loop iterations. But in the latter case any parallelization wouldn't work correctly.

Beginner
706 Views

Hi,

Each FFT(thread safe version)is doneindependently for each loop iteration. The row operations will not start until all column operations are finished. There should not be any data dependency between loop iterations, unless parallel_for breaks eachFFT loop down to many smaller ones.

Regards,

Long

Employee
706 Views

The work in a user-defined body on the given range of iterations is not broken down further, unless explicitly specified by a nested parallel algortihm.

Thus it seems to me your assumption about in-place processing being impossible with tbb::parallel_for is not correct. You might give it another try now, as the real problem has been identified and fixed.