Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Help on parallel_for

long_to
Beginner
748 Views

Hi All,

I am having problem with parallel_for. Basically, I am trying toparallelize the two dimensional FFT of a squared matrix.the function, apply_fft_column( ), implements column-wiseFFT of two NxN square matrices, Via and Vqa. A similar function is used row-wise. The functions work fine serially. When I use it with parallel_for as shown in the codebelow, the matrices return with wrong results. MatDoub and VecDoubare matrix andvector classes,implemented by the new Numerical Recipe. I have used parallel_for successfully before, with separate input and output matrices. In this case, the matrices Via and Vqa are being used both for input and output. Any suggestion? Thanks!

Regards,

Long

void

apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa )

{

VecDoub fft_real_in(N);

VecDoub fft_imag_in(N);

VecDoub fft_real_out(N);

VecDoub fft_imag_out(N);

for(int i=0; i

{

fft_real_in = Via;

fft_imag_in = Vqa;

fft_real_in[i+1] = -Via[i+1];

fft_imag_in[i+1] = -Vqa[i+1];

}

fft(N,&fft_real_in[0],&fft_imag_in[0],&fft_real_out[0],&fft_imag_out[0]);

for(int i=0; i

{

Via = fft_real_out;

Vqa = fft_imag_out;

}

}

class

parallel_forward_fft_column

{

const int my_N;

MatDoub &my_Via;

MatDoub &my_Vqa;

public

:

void operator()(const blocked_range<int>& r) const

{

// const declaration for input data into functor

const int N = my_N;

// non-const reference for output data back to the caller

MatDoub &Via = my_Via;

MatDoub &Vqa = my_Vqa;

for(int j=r.begin(); j!=r.end(); ++j)

apply_fft_column(j,N,Via,Vqa);

}

// default constructor

parallel_forward_fft_column(

const int N,

MatDoub &Via,MatDoub &Vqa) :

my_N(N),

my_Via(Via),my_Vqa(Vqa)

{

}

};

// auto grain size

void

Parallel_forward_fft_column(const int N, MatDoub &Via, MatDoub &Vqa)

{

parallel_forward_fft_column parallel_fft_column(N,Via,Vqa);

parallel_for(blocked_range<

int>(0,N),parallel_fft_column,auto_partitioner());

}

0 Kudos
14 Replies
long_to
Beginner
748 Views
Some new info on the parallel_for problem. Since the last post, 
I've discovered that the fft() routine that I used is not thread safe. 
It uses global variables. I've replaced it with the routine, four1(), 
from Numerical Recipe. The function to be parallel is now:

void

apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa)

{

VecDoub data(2*N);

int k=0;

for(int i=0; i

{

data = Via;

data[k+1] = Vqa;

data[k+2] = -Via[i+1];

data[k+3] = -Vqa[i+1];

k+=4;

}

four1(data,-1);

int l=0;

for(int i=0; i

{

Via = data;

Vqa = data[l+1];

l+=2;

}

}

With the new routine, the parallel_for from TBB is still giving wrong results. However,

if I compiled the code using OpenMP library as shown below provided by Microsoft Visual

studio 2008 professional compiler, the routine gives correct answer.

#pragma

omp parallel for

for(int j=0; j

{

apply_fft_column(j,N,Via,Vqa);

}

Am I doing something wrong with parallel_for? Perhaps parallel_for can not be used this way?

Any suggestions, Thanks!

Regards,

Long

P.S, I am using TBB v2.0 (tbb20_020oss) commercial aligned release.

0 Kudos
Andrey_Marochko
New Contributor III
748 Views
At the first glance your TBB code looks OK (except a few tiny nits that are not important right now). Just to be sure, when you compiled OpenMP version of your app, have you added "/openmp" command line option to CL? Without it pragma omp is silently ignored and the code works serially. Thus you could just have overlooked some other problem in the transformation codesmiley [:-)]

0 Kudos
long_to
Beginner
748 Views

Yes, /openmp is enabled in the project. The application requires vcomp90.dll and definitely run in parallel mode. The interesting thing is thatthe application actually uses mostly TBB and only openmp for the parallel FFT code. I have also ported the application to use openmp solely, and it works just fine. I would like to find out what I do wrong with parallel_for. Any help will be greatly appreciated!

Regards,

Long

P.S. The 2 dimentional parallel FFT routine, my first thought, seems to be an easy program to write. Each FFT is independent from each other, and ease of implementation in openmp confirms it. There must be something I overlook in parallel_for code.

0 Kudos
Alexey-Kukanov
Employee
748 Views

Andrey and metried hard to find a problem in how you use TBB, without any success. Also I looked at four1 (as listed at http://www.ddj.com/cpp/199500857); the function uses only local variables and arguments, and, being called for a local array of data, is thread-safe.

The only weird suggestion I have is to explicitly define a copy constructor in your class used with parallel_for. There is nothing in the class that would cause troubles for a compiler to generate an implicit copy constructor, but who knows...

Also you could try a newer version of TBB, e.g. the most recent commercial-aligned release corresponding to just-released TBB 2.1. It's unlikely it will change anything for you, but again, who knows...

0 Kudos
long_to
Beginner
748 Views

Thanks. I'll try the suggestions and report back the results.

Regards,

Long

0 Kudos
long_to
Beginner
748 Views

Hello Alexey, Andrey

I've tried your suggestions of using the explicitly defined copy constructors in my classesand the newTBB 2.1 (commercial aligned release.) I could not get the application to function correctly. Thanks!

Regards,

Long

0 Kudos
Wooyoung_K_Intel
Employee
748 Views

Hi, Long,

Just for testing, could you try defining a separate out matricesand using them to get the results back instead of writing the results back into the input matrices?

-wooyoung

0 Kudos
long_to
Beginner
748 Views

I'll give that a try.

Long

0 Kudos
long_to
Beginner
748 Views

Hello wooyoung,

It's still not working. I have removed the copy constructior from the classes. The code I used is shown below. Sorry for the format, I don't know how to get it back. May be you can spot the errors. Thanks!

p.s. I make sure the matrix size is power of 2. :)

///////////////////////////////////////////////////////////////////////////////////////////////
void
four1(Doub *data, const Int n, const Int isign) { ...
}
void
four1(VecDoub_IO &data, const Int isign) {
four1(&data[0],data.size()/2,isign);
}
void
_apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
VecDoub data(2*N);
int k=0;
for(int i=0; i
{
data = Via_temp;
data[k+1] = Vqa_temp;
data[k+2] = -Via_temp[i+1];
data[k+3] = -Vqa_temp[i+1];
k+=4;
}
four1(data,-1);
int l=0;
for(int i=0; i
{
Via = data;
Vqa = data[l+1];
l+=2;
} 
}
void
_apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
VecDoub data(2*N);
int k=0;
for(int j=0; j
{
data = Via_temp;
data[k+1] = Vqa_temp;
data[k+2] = -Via_temp[j+1];
data[k+3] = -Vqa_temp[j+1];
k+=4;
}
four1(data,-1);
int l=0;
for(int j=0; j
{
Via = data;
Vqa = data[l+1];
l+=2;
} 
}
/////////////////////////////////////////////////////////////////////////////////////
class
parallel_forward_fft_column
{
const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public:
void operator()(const blocked_range<int>& r) const
{
// const declaration for input data into functor
const int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;
// non-con st reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}
// default constructor
parallel_forward_fft_column(
const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
MatDoub_O &Via, MatDoub_O &Vqa) :
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
void
Parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<
int>(0,N),parallel_fft_column,auto_partitioner());
}
class
parallel_forward_fft_row
{
const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public
:
void operator()(const blocked_ra nge<int>& r) const
{
// const declaration for input data into functor
const int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;
// non-const reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}
// default constructor
parallel_forward_fft_row(
const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
MatDoub_O &Via, MatDoub_O &Vqa) : 
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),
my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
void
Parallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<
int>(0,N),parallel_fft_row,auto_partitioner());
}
int main(...)
{
   ...

task_scheduler_init init;

MatDoub Via(N,N),Vqa(N,N);
...
MatDoub Via_temp(N,N),Vqa_temp(N,N);
for(int j=0;j
for(int i=0;i
{
Via_temp = Via; 
Vqa_temp = Vqa;
}

Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);

Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);

...

}

0 Kudos
Wooyoung_K_Intel
Employee
748 Views

Hi,

Was it failing on both _row and _column versions?

I noticed that _apply_fft_column() is called in parallel_forward_fft_row::().

class
parallel_forward_fft_row
{
const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public
:
void operator()(const blocked_range<int>& r) const
{
// const declaration for input data into functor
const int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;
// non-const reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}

And, I suspected that since the same matrices were used for inputas well asoutput, when the last row of a subrage (in the column-based version)was read,the output of the first rowof the next subrange was read instead of the inputrow.

Other than the two,the code looks o.k... Maybe you want toput the TBB versionnext to the OMPversion and compareeach rowto see where thetwo versions generate different results?

best regards,

-wooyoung

0 Kudos
long_to
Beginner
748 Views

Thanks for the good catch! I knew I was doing something wrong.

Yes, it was my mistake when copying and pasting the source codes. I made another error by not re-initializing the Via_temp and Vqa_temp to new values of Via and Vqa After the function _apply_fft_column() call. After these two corrections, the program produces the correct result. The application generates 2D image as final output. So I can tell whether the FFTs are doing their job.

I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for. I guess, in this case, openMP is the way to go.

Thanks again for your helps!

Regards,

Long

The final working code is shown below:

///////////////////////////////////////////////////////////////////////////////////////////////

void

four1(Doub *data, const Int n, const Int isign)

{

}

void

four1(VecDoub_IO &data, const Int isign)

{

four1(&data[0],data.size()/2,isign);

}

void

_apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)

{

VecDoub data(2*N);

int k=0;

for(int i=0; i

{

data = Via_temp;

data[k+1] = Vqa_temp;

data[k+2] = -Via_temp[i+1];

data[k+3] = -Vqa_temp[i+1];

k+=4;

}

four1(data,-1);

int l=0;

for(int i=0; i

{

Via = data;

Vqa = data[l+1];

l+=2;

}

}

void

_apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)

{

VecDoub data(2*N);

int k=0;

for(int j=0; j

{

data = Via_temp;

data[k+1] = Vqa_temp;

data[k+2] = -Via_temp[j+1];

data[k+3] = -Vqa_temp[j+1];

k+=4;

}

four1(data,-1);

int l=0;

for(int j=0; j

{

Via = data;

Vqa = data[l+1];

l+=2;

}

}

/////////////////////////////////////////////////////////////////////////////////////

class

parallel_forward_fft_column

{

const int my_N;

MatDoub_I &my_Via_temp;

MatDoub_I &my_Vqa_temp;

MatDoub_O &my_Via;

MatDoub_O &my_Vqa;

public

:

void operator()(const blocked_range<int>& r) const

{

  ;// const declaration for input data into functor

const int N = my_N;

MatDoub_I &Via_temp = my_Via_temp;

MatDoub_I &Vqa_temp = my_Vqa_temp;

// non-const reference for output data back to the caller

MatDoub_O &Via = my_Via;

MatDoub_O &Vqa = my_Vqa;

for(int j=r.begin(); j!=r.end(); ++j)

_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);

}

// default constructor

parallel_forward_fft_column(

const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,

MatDoub &Via, MatDoub &Vqa) :

my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa)

{

}

};

// auto grain size

void

Parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)

{

parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);

parallel_for(blocked_range<

int>(0,N),parallel_fft_column,auto_partitioner());

}

class

parallel_forward_fft_row

{

const int my_N;

MatDoub_I &my_Via_temp;

MatDoub_I &my_Vqa_temp;

MatDoub_O &my_Via;

MatDoub_O &my_Vqa;

public

:

void operator()(const blocked_range<int>& r) const

{

// const declaration for input data into functor

const int N = my_N;

MatDoub_I &Via_temp = my_Via_temp;

MatDoub_I &Vqa_temp = my_Vqa_temp;

// non-const reference for output data back to the caller

MatDoub_O &Via = my_Via;

MatDoub_O &Vqa = my_Vqa;

for(int j=r.begin(); j!=r.end(); ++j)

_apply_fft_row(j,N,Via_temp,Vqa_temp,Via,Vqa);

}

// default constructor

parallel_forward_fft_row(

const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,

MatDoub_O &Via, MatDoub_O &Vqa) :

my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),

my_Via(Via),my_Vqa(Vqa)

{

}

};

// auto grain size

void

Parallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)

{

parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);

parallel_for(blocked_range<

int>(0,N),parallel_fft_row,auto_partitioner());

}

int

main(int argc, char *argv[])

{

MatDoub Via(N,N),Vqa(N,N);

...

// initialize parallel code here

task_scheduler_init init;

MatDoub Via_temp(N,N),Vqa_temp(N,N);

for(int j=0;j

for(int i=0;i

{

Via_temp = Via;

Vqa_temp = Vqa;

}

cout <<

"2D FFT to spatial domain..." << endl;

Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);

for(int j=0;j

for(int i=0;i

{

Via_temp = Via;

Vqa_temp = Vqa;

}

Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);

...

}

0 Kudos
Alexey-Kukanov
Employee
748 Views

long.to@navy.mil:
I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for.

Could you test your guess? It seems to me the actual problem is unrelated to in-place processing. In fact, TBB does nothing with your arrays except passing them by reference into Body objects and across threads.I think this should not prevent in-place processing as long as there is no data dependency between loop iterations. But in the latter case any parallelization wouldn't work correctly.

0 Kudos
long_to
Beginner
748 Views

Hi,

Each FFT(thread safe version)is doneindependently for each loop iteration. The row operations will not start until all column operations are finished. There should not be any data dependency between loop iterations, unless parallel_for breaks eachFFT loop down to many smaller ones.

Regards,

Long

0 Kudos
Alexey-Kukanov
Employee
748 Views

The work in a user-defined body on the given range of iterations is not broken down further, unless explicitly specified by a nested parallel algortihm.

Thus it seems to me your assumption about in-place processing being impossible with tbb::parallel_for is not correct. You might give it another try now, as the real problem has been identified and fixed.

0 Kudos
Reply