- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for(stage1=0;stage1 < lim;stage1 ++)
{
butterfly_index1 = butterfly_index2;
butterfly_index2 = butterfly_index2 + butterfly_index2;
k = -1*2*PI/butterfly_index2;
k_next = 0.0;
for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
{
sine=sin(k_next);
cosine = cos(k_next);
k_next = k_next + k;
FFT. sine = sine;
FFT.cosine = cosine;
FFT.n2 = butterfly_index2;
FFT.loop_init = stage2;
FFT.n1 = butterfly_index1;
parallel_for(blocked_range
// for(int k = stage2; k < SIZE; k+=butterfly_index2)
// {
// temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;
// temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;
// X[k + butterfly_index1].real = X
// X[k + butterfly_index1].img = X
// X
// X
// }
//
}
}
parallel_for body:
for(int k = r.begin(); k != r.end(); k++)
{
if(k !=loop_init)
{
if((k - (loop_init))% (n2)!= 0)
continue;
}
temp_real = (cosine) * X[k + n1].real - (sine) * X[k + n1].img;
temp_img = (sine)* X[k + n1].real + (cosine) * X[k + n1].img;
X[k + n1].real = X
X[k + n1].img = X
X
X
}
I am using TBB version 3.0 update 5 on fedora core 13 (64 bit). If instead of parallel_for i use commented code the performance is thousands time higer (execution time is lower).
suggest me improvements im my code
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My suggestion was to look for parallelism above the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!
(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.
(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But question is why poor performance for innermost loop parallelization in spite of controling the chunksize of workload.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My suggestion was to look for parallelism above the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!
(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.
(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for your assistance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you say "concurrent vector of data", did you mean a plain C array or a tbb::concurrent_vector
Writing afast parallel FFT on a modern shared-memory machine is surprisingly tricky. A straightforward parallelization of the loop form of Cooley-Tukey has relatively poor cache behavior and high synchronization overhead. Divide-and-conquer strategies generally perform much better. E.g. FFTW (http://www.fftw.org/) uses divide-and-conquer strategries. See also Figure 2.9 of http://www.cs.berkeley.edu/~vazirani/algorithms/chap2.pdf .
I believe it is possible to write a parallel FFT using only two parallel_for invocations, but do not know the details. See http://www.fftw.org/parallel/parallel-fftw.html , the subsection titled "One-dimensinoal Parallel FFT" for a hint of how to do this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if(butterfly_index2 < threshold)
parallel_for(blocked_range
else
{
for(int k = stage2; k < SIZE; k+=butterfly_index2)
{
temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;
temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;
X[k + butterfly_index1].real = X
X[k + butterfly_index1].img = X
X
X
}
which switches between serial and parallel versions depending upon value of threshold that is determined experimently. As this is not computationally extensive code i was able to getperformance improvement maximum upto 10%. However it i add a redundent computational extensive code (i.e for(int test=0;test<10000;test++);) just for experiment then performance improvement exceed upto 60% on dual core system and 600% on 8 core machine.
using customly just one thread to evaluate the FFT (by giving chunk size greater than orignal iteration space) slightly degrades performance (5 to 6 %).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash]#includeThough for a 1024 point FFT, it may be difficult to get much speedup anyway. Each subproblem will have the flow structure of a 32-point FFT (albeit with different twiddle factors). There's O(N lg N) operations on each of these subproblems. With N=32, the number of operations is going to fall far short of the rule-of-thumb 10,000-100,000 operations per task to get good speedup. It might be better to look at a higher level in the code to parallelize.#include "tbb/tbb.h" // Do decimation on x[0], x[stride], x[stride*2], ... x[stride*(n-1)] // for all stages s in [sFirst,sLast). void Decimate( int sFirst, int sLast, std::complex x[], size_t n, size_t stride ); // In-place FFT of array x that has length 1< x[], int lgN ) { // Chop into k subproblems of size m and solve them int lgM = lgN/2; size_t m = 1< [/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think it will be better to allow TBB to decide the stride size rather that statically assigning the stride size. it will use the hardware resources efficiently i guess.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page