- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

for(stage1=0;stage1 < lim;stage1 ++)

{

butterfly_index1 = butterfly_index2;

butterfly_index2 = butterfly_index2 + butterfly_index2;

k = -1*2*PI/butterfly_index2;

k_next = 0.0;

for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)

{

sine=sin(k_next);

cosine = cos(k_next);

k_next = k_next + k;

FFT. sine = sine;

FFT.cosine = cosine;

FFT.n2 = butterfly_index2;

FFT.loop_init = stage2;

FFT.n1 = butterfly_index1;

parallel_for(blocked_range

// for(int k = stage2; k < SIZE; k+=butterfly_index2)

// {

// temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;

// temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;

// X[k + butterfly_index1].real = X

// X[k + butterfly_index1].img = X

// X

// X

// }

//

}

}

parallel_for body:

for(int k = r.begin(); k != r.end(); k++)

{

if(k !=loop_init)

{

if((k - (loop_init))% (n2)!= 0)

continue;

}

temp_real = (cosine) * X[k + n1].real - (sine) * X[k + n1].img;

temp_img = (sine)* X[k + n1].real + (cosine) * X[k + n1].img;

X[k + n1].real = X

X[k + n1].img = X

X

X

}

I am using TBB version 3.0 update 5 on fedora core 13 (64 bit). If instead of parallel_for i use commented code the performance is thousands time higer (execution time is lower).

suggest me improvements im my code

thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

My suggestion was to look for parallelism

*above*the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

But question is why poor performance for innermost loop parallelization in spite of controling the chunksize of workload.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

My suggestion was to look for parallelism

*above*the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

thanks for your assistance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

When you say "concurrent vector of data", did you mean a plain C array or a tbb::concurrent_vector

Writing afast parallel FFT on a modern shared-memory machine is surprisingly tricky. A straightforward parallelization of the loop form of Cooley-Tukey has relatively poor cache behavior and high synchronization overhead. Divide-and-conquer strategies generally perform much better. E.g. FFTW (http://www.fftw.org/) uses divide-and-conquer strategries. See also Figure 2.9 of http://www.cs.berkeley.edu/~vazirani/algorithms/chap2.pdf .

I believe it is possible to write a parallel FFT using only two parallel_for invocations, but do not know the details. See http://www.fftw.org/parallel/parallel-fftw.html , the subsection titled "One-dimensinoal Parallel FFT" for a hint of how to do this.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if(butterfly_index2 < threshold)

parallel_for(blocked_range

else

{

for(int k = stage2; k < SIZE; k+=butterfly_index2)

{

temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;

temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;

X[k + butterfly_index1].real = X

X[k + butterfly_index1].img = X

X

X

}

which switches between serial and parallel versions depending upon value of threshold that is determined experimently. As this is not computationally extensive code i was able to getperformance improvement maximum upto 10%. However it i add a redundent computational extensive code (i.e for(int test=0;test<10000;test++);) just for experiment then performance improvement exceed upto 60% on dual core system and 600% on 8 core machine.

using customly just one thread to evaluate the FFT (by giving chunk size greater than orignal iteration space) slightly degrades performance (5 to 6 %).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[bash]#includeThough for a 1024 point FFT, it may be difficult to get much speedup anyway. Each subproblem will have the flow structure of a 32-point FFT (albeit with different twiddle factors). There's O(N lg N) operations on each of these subproblems. With N=32, the number of operations is going to fall far short of the rule-of-thumb 10,000-100,000 operations per task to get good speedup. It might be better to look at a higher level in the code to parallelize.#include "tbb/tbb.h" // Do decimation on x[0], x[stride], x[stride*2], ... x[stride*(n-1)] // for all stages s in [sFirst,sLast). void Decimate( int sFirst, int sLast, std::complex x[], size_t n, size_t stride ); // In-place FFT of array x that has length 1< x[], int lgN ) { // Chop into k subproblems of size m and solve them int lgM = lgN/2; size_t m = 1< [/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I think it will be better to allow TBB to decide the stride size rather that statically assigning the stride size. it will use the hardware resources efficiently i guess.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page