- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am having problem with parallel_for. Basically, I am trying toparallelize the two dimensional FFT of a squared matrix.the function, apply_fft_column( ), implements column-wiseFFT of two NxN square matrices, Via and Vqa. A similar function is used row-wise. The functions work fine serially. When I use it with parallel_for as shown in the codebelow, the matrices return with wrong results. MatDoub and VecDoubare matrix andvector classes,implemented by the new Numerical Recipe. I have used parallel_for successfully before, with separate input and output matrices. In this case, the matrices Via and Vqa are being used both for input and output. Any suggestion? Thanks!
Regards,
Long
void
apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa ){
VecDoub fft_real_in(N);
VecDoub fft_imag_in(N);
VecDoub fft_real_out(N);
VecDoub fft_imag_out(N);
for(int i=0; ifor(int i=0; i {
fft_real_in = Via
; fft_imag_in = Vqa
; fft_real_in[i+1] = -Via
[i+1]; fft_imag_in[i+1] = -Vqa
[i+1]; }
fft(N,&fft_real_in[0],&fft_imag_in[0],&fft_real_out[0],&fft_imag_out[0]);
{
Via
= fft_real_out; Vqa
= fft_imag_out; }
}
class
parallel_forward_fft_column{
const int my_N;MatDoub &my_Via;
MatDoub &my_Vqa;
public
: void operator()(const blocked_range<int>& r) const{
// const declaration for input data into functor const int N = my_N; // non-const reference for output data back to the callerMatDoub &Via = my_Via;
MatDoub &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)apply_fft_column(j,N,Via,Vqa);
}
// default constructorparallel_forward_fft_column(
const int N,MatDoub &Via,MatDoub &Vqa) :
my_N(N),
my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
void
Parallel_forward_fft_column(const int N, MatDoub &Via, MatDoub &Vqa){
parallel_forward_fft_column parallel_fft_column(N,Via,Vqa);
parallel_for(blocked_range<
int>(0,N),parallel_fft_column,auto_partitioner());}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some new info on the parallel_for problem. Since the last post,
I've discovered that the fft() routine that I used is not thread safe.
It uses global variables. I've replaced it with the routine, four1(),
from Numerical Recipe. The function to be parallel is now:
void
apply_fft_column(int j, int N, MatDoub &Via, MatDoub &Vqa){
VecDoub data(2*N);
int k=0;for(int i=0; iint l=0;for(int i=0; i {
data
= Via ; data[k+1] = Vqa
; data[k+2] = -Via
[i+1]; data[k+3] = -Vqa
[i+1]; k+=4;
}
four1(data,-1);
{
Via
= data ; Vqa
= data[l+1]; l+=2;
}
}
With the new routine, the parallel_for from TBB is still giving wrong results. However,
if I compiled the code using OpenMP library as shown below provided by Microsoft Visual
studio 2008 professional compiler, the routine gives correct answer.
#pragma
omp parallel for for(int j=0; j{
apply_fft_column(j,N,Via,Vqa);
}
Am I doing something wrong with parallel_for? Perhaps parallel_for can not be used this way?
Any suggestions, Thanks!
Regards,
Long
P.S, I am using TBB v2.0 (tbb20_020oss) commercial aligned release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, /openmp is enabled in the project. The application requires vcomp90.dll and definitely run in parallel mode. The interesting thing is thatthe application actually uses mostly TBB and only openmp for the parallel FFT code. I have also ported the application to use openmp solely, and it works just fine. I would like to find out what I do wrong with parallel_for. Any help will be greatly appreciated!
Regards,
Long
P.S. The 2 dimentional parallel FFT routine, my first thought, seems to be an easy program to write. Each FFT is independent from each other, and ease of implementation in openmp confirms it. There must be something I overlook in parallel_for code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey and metried hard to find a problem in how you use TBB, without any success. Also I looked at four1 (as listed at http://www.ddj.com/cpp/199500857); the function uses only local variables and arguments, and, being called for a local array of data, is thread-safe.
The only weird suggestion I have is to explicitly define a copy constructor in your class used with parallel_for. There is nothing in the class that would cause troubles for a compiler to generate an implicit copy constructor, but who knows...
Also you could try a newer version of TBB, e.g. the most recent commercial-aligned release corresponding to just-released TBB 2.1. It's unlikely it will change anything for you, but again, who knows...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. I'll try the suggestions and report back the results.
Regards,
Long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Alexey, Andrey
I've tried your suggestions of using the explicitly defined copy constructors in my classesand the newTBB 2.1 (commercial aligned release.) I could not get the application to function correctly. Thanks!
Regards,
Long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Long,
Just for testing, could you try defining a separate out matricesand using them to get the results back instead of writing the results back into the input matrices?
-wooyoung
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll give that a try.
Long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello wooyoung,
It's still not working. I have removed the copy constructior from the classes. The code I used is shown below. Sorry for the format, I don't know how to get it back. May be you can spot the errors. Thanks!
p.s. I make sure the matrix size is power of 2. :)
///////////////////////////////////////////////////////////////////////////////////////////////
voidfour1(Doub *data, const Int n, const Int isign) { ...
}
voidfour1(VecDoub_IO &data, const Int isign) {
four1(&data[0],data.size()/2,isign);
}
void_apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
VecDoub data(2*N);int k=0;for(int i=0; i
{
data= Via_temp ;
data[k+1] = Vqa_temp;
data[k+2] = -Via_temp[i+1];
data[k+3] = -Vqa_temp[i+1];
k+=4;
}
four1(data,-1);
{
Via= data ;
Vqa= data[l+1];
l+=2;
}
}
void_apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
VecDoub data(2*N);
int k=0;for(int j=0; j
{
data= Via_temp ;
data[k+1] = Vqa_temp;
data[k+2] = -Via_temp[j+1];
data[k+3] = -Vqa_temp[j+1];
k+=4;
}
four1(data,-1);
{
Via= data ;
Vqa= data[l+1];
l+=2;
}
}
/////////////////////////////////////////////////////////////////////////////////////
classparallel_forward_fft_column
{const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;public:void operator()(const blocked_range<int>& r) const
{// const declaration for input data into functorconst int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;// non-con st reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}// default constructor
parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
MatDoub_O &Via, MatDoub_O &Vqa) :
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
voidParallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<int>(0,N),parallel_fft_column,auto_partitioner());
}
classparallel_forward_fft_row
{const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public:void operator()(const blocked_ra nge<int>& r) const
{// const declaration for input data into functorconst int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;// non-const reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}// default constructor
parallel_forward_fft_row(const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,
MatDoub_O &Via, MatDoub_O &Vqa) :
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),
my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
voidParallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa)
{
parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<int>(0,N),parallel_fft_row,auto_partitioner());
}
int main(...)
{
...
task_scheduler_init init;
MatDoub Via(N,N),Vqa(N,N);
...
MatDoub Via_temp(N,N),Vqa_temp(N,N);for(int j=0;j
{
Via_temp= Via ;
Vqa_temp= Vqa ;
}
Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);
Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);
...
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Was it failing on both _row and _column versions?
I noticed that _apply_fft_column() is called in parallel_forward_fft_row::().
classparallel_forward_fft_row
{
const int my_N;
MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public:void operator()(const blocked_range<int>& r) const
{// const declaration for input data into functorconst int N = my_N;
MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;// non-const reference for output data back to the caller
MatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;for(int j=r.begin(); j!=r.end(); ++j)
_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}
And, I suspected that since the same matrices were used for inputas well asoutput, when the last row of a subrage (in the column-based version)was read,the output of the first rowof the next subrange was read instead of the inputrow.
Other than the two,the code looks o.k... Maybe you want toput the TBB versionnext to the OMPversion and compareeach rowto see where thetwo versions generate different results?
best regards,
-wooyoung
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the good catch! I knew I was doing something wrong.
Yes, it was my mistake when copying and pasting the source codes. I made another error by not re-initializing the Via_temp and Vqa_temp to new values of Via and Vqa After the function _apply_fft_column() call. After these two corrections, the program produces the correct result. The application generates 2D image as final output. So I can tell whether the FFTs are doing their job.
I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for. I guess, in this case, openMP is the way to go.
Thanks again for your helps!
Regards,
Long
The final working code is shown below:
///////////////////////////////////////////////////////////////////////////////////////////////
void
four1(Doub *data, const Int n, const Int isign){
}
void
four1(VecDoub_IO &data, const Int isign){
four1(&data[0],data.size()/2,isign);
}
void
_apply_fft_column(int j, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){
VecDoub data(2*N);
int k=0;for(int i=0; iint l=0;for(int i=0; i {
data
= Via_temp ; data[k+1] = Vqa_temp
; data[k+2] = -Via_temp
[i+1]; data[k+3] = -Vqa_temp
[i+1]; k+=4;
}
four1(data,-1);
{
Via
= data ; Vqa
= data[l+1]; l+=2;
}
}
void
_apply_fft_row(int i, int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){
VecDoub data(2*N);
int k=0;for(int j=0; jint l=0;for(int j=0; j {
data
= Via_temp ; data[k+1] = Vqa_temp
; data[k+2] = -Via_temp[j+1];
data[k+3] = -Vqa_temp[j+1];
k+=4;
}
four1(data,-1);
{
Via
= data ; Vqa
= data[l+1]; l+=2;
}
}
/////////////////////////////////////////////////////////////////////////////////////
class
parallel_forward_fft_column{
const int my_N;MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public
:void operator()(const blocked_range<int>& r) const{
;// const declaration for input data into functorconst int N = my_N;MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;
// non-const reference for output data back to the callerMatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)_apply_fft_column(j,N,Via_temp,Vqa_temp,Via,Vqa);
}
// default constructorparallel_forward_fft_column(
const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,MatDoub &Via, MatDoub &Vqa) :
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
void
Parallel_forward_fft_column(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){
parallel_forward_fft_column parallel_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<
int>(0,N),parallel_fft_column,auto_partitioner());}
class
parallel_forward_fft_row{
const int my_N;MatDoub_I &my_Via_temp;
MatDoub_I &my_Vqa_temp;
MatDoub_O &my_Via;
MatDoub_O &my_Vqa;
public
:void operator()(const blocked_range<int>& r) const{
// const declaration for input data into functorconst int N = my_N;MatDoub_I &Via_temp = my_Via_temp;
MatDoub_I &Vqa_temp = my_Vqa_temp;
// non-const reference for output data back to the callerMatDoub_O &Via = my_Via;
MatDoub_O &Vqa = my_Vqa;
for(int j=r.begin(); j!=r.end(); ++j)_apply_fft_row(j,N,Via_temp,Vqa_temp,Via,Vqa);
}
// default constructorparallel_forward_fft_row(
const int N,MatDoub_I &Via_temp, MatDoub_I &Vqa_temp,MatDoub_O &Via, MatDoub_O &Vqa) :
my_N(N),my_Via_temp(Via_temp),my_Vqa_temp(Vqa_temp),
my_Via(Via),my_Vqa(Vqa)
{
}
};
// auto grain size
void
Parallel_forward_fft_row(const int N, MatDoub_I &Via_temp, MatDoub_I &Vqa_temp, MatDoub_O &Via, MatDoub_O &Vqa){
parallel_forward_fft_row parallel_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);
parallel_for(blocked_range<
int>(0,N),parallel_fft_row,auto_partitioner());}
int
main(int argc, char *argv[]){
MatDoub Via(N,N),Vqa(N,N);
...
// initialize parallel code heretask_scheduler_init init;
MatDoub Via_temp(N,N),Vqa_temp(N,N);
for(int j=0;jfor(int i=0;i "2D FFT to spatial domain..." << endl; {
Via_temp
= Via ; Vqa_temp
= Vqa ; }
cout <<
Parallel_forward_fft_column(N,Via_temp,Vqa_temp,Via,Vqa);
for(int j=0;jfor(int i=0;i {
Via_temp
= Via ; Vqa_temp
= Vqa ; }
Parallel_forward_fft_row(N,Via_temp,Vqa_temp,Via,Vqa);
...
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
long.to@navy.mil:
I guess the parallel_for, as currently implemented, CAN NOT be used to process in-place matrices. The input and output matrices or arrays can not be the same. This requirement will double the memory requirements of the application. The reason I try to parallel the FFT codes is for working with large matrices. With large matrices(ex, 20,000 x 20,000), it becomes impractical to use parallel_for.
Could you test your guess? It seems to me the actual problem is unrelated to in-place processing. In fact, TBB does nothing with your arrays except passing them by reference into Body objects and across threads.I think this should not prevent in-place processing as long as there is no data dependency between loop iterations. But in the latter case any parallelization wouldn't work correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Each FFT(thread safe version)is doneindependently for each loop iteration. The row operations will not start until all column operations are finished. There should not be any data dependency between loop iterations, unless parallel_for breaks eachFFT loop down to many smaller ones.
Regards,
Long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The work in a user-defined body on the given range of iterations is not broken down further, unless explicitly specified by a nested parallel algortihm.
Thus it seems to me your assumption about in-place processing being impossible with tbb::parallel_for is not correct. You might give it another try now, as the real problem has been identified and fixed.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page