Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_for for nested loops

Ahmed_Numan
Beginner
897 Views
I am trying to encode FFT (cooley and tukey algorithm) in TBB. in this algorithm i can parallelize only the innermost loop. but by parallelizing this loop the performance of code degrades thousand times. i have tried long arrays (up to 2^20) and different partitioning options but it has no effect on performance. here is my code. SIZE = 1024 and X is concurrent vector of data (structure with 2 float members) of size 1024.

for(stage1=0;stage1 < lim;stage1 ++)
{
butterfly_index1 = butterfly_index2;
butterfly_index2 = butterfly_index2 + butterfly_index2;
k = -1*2*PI/butterfly_index2;
k_next = 0.0;
for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
{
sine=sin(k_next);
cosine = cos(k_next);
k_next = k_next + k;
FFT. sine = sine;
FFT.cosine = cosine;
FFT.n2 = butterfly_index2;
FFT.loop_init = stage2;
FFT.n1 = butterfly_index1;
parallel_for(blocked_range(stage2,SIZE,512),FFT,simple_partitioner());
// for(int k = stage2; k < SIZE; k+=butterfly_index2)
// {
// temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;
// temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;
// X[k + butterfly_index1].real = X.real - temp_real;
// X[k + butterfly_index1].img = X.img - temp_img;
// X.real = X.real + temp_real;
// X.img = X.img + temp_img;
// }
//


}
}

parallel_for body:

for(int k = r.begin(); k != r.end(); k++)
{
if(k !=loop_init)
{
if((k - (loop_init))% (n2)!= 0)
continue;
}
temp_real = (cosine) * X[k + n1].real - (sine) * X[k + n1].img;
temp_img = (sine)* X[k + n1].real + (cosine) * X[k + n1].img;
X[k + n1].real = X.real - temp_real;
X[k + n1].img = X.img - temp_img;
X.real = X.real + temp_real;
X.img = X.img + temp_img;
}

I am using TBB version 3.0 update 5 on fedora core 13 (64 bit). If instead of parallel_for i use commented code the performance is thousands time higer (execution time is lower).

suggest me improvements im my code
thanks


0 Kudos
1 Solution
RafSchietekat
Valued Contributor III
897 Views
TBB considerably lowers the cost of doing parallel business compared to direct use of threads, but not to zero, so you always have to observe a minimum amount of work that you shouldn't try to subdivide further. But even if you set a limit that prevents such subdivision, the very act of using tasks below a reasonable amortisation threshold kills performance. TBB's parallel_sort algorithm will start by evaluating whether it even makes sense to spawn a single task, but that's because it assumes that comparisons are cheap compared to other parts of the algorithm that it knows; parallel_for cannot make that assumption, so you are responsible for avoiding its use when there's insufficient parallelism. Would you agree that this explains your observatons?

My suggestion was to look for parallelism above the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.

View solution in original post

0 Kudos
9 Replies
RafSchietekat
Valued Contributor III
897 Views
You're not kidding... isn't there an opportunity for parallelism around the outermost loop, in the application domain?
0 Kudos
Ahmed_Numan
Beginner
897 Views
in butterfly diagram of FFT the first stage should be computed earlier than second stage. this means the outer loop (which is iterating over stages) should be sequential for correctness of the algorithm.Parallelizing this loop will cause wrong answer as it will process stages randomly. I have experimented it and get wrong answer. parallelizing outer loop cause improvement in performance but wrong result.
But question is why poor performance for innermost loop parallelization in spite of controling the chunksize of workload.
0 Kudos
RafSchietekat
Valued Contributor III
898 Views
TBB considerably lowers the cost of doing parallel business compared to direct use of threads, but not to zero, so you always have to observe a minimum amount of work that you shouldn't try to subdivide further. But even if you set a limit that prevents such subdivision, the very act of using tasks below a reasonable amortisation threshold kills performance. TBB's parallel_sort algorithm will start by evaluating whether it even makes sense to spawn a single task, but that's because it assumes that comparisons are cheap compared to other parts of the algorithm that it knows; parallel_for cannot make that assumption, so you are responsible for avoiding its use when there's insufficient parallelism. Would you agree that this explains your observatons?

My suggestion was to look for parallelism above the outermost loop, while avoiding breaking dependencies. I was surprised to see that one application of FFT is... numbers multiplication!

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.
0 Kudos
Ahmed_Numan
Beginner
897 Views
By changing the header of for in body of parallel_for (r.end() to a local variable) improved the performance. Further more by switching between serial version and parallel_for improved the performance alot. Now i have changed the code so that it can select the serial or perallel version depending upon the number of multiplication in one iteration and it has improved the performance.
thanks for your assistance.
0 Kudos
RafSchietekat
Valued Contributor III
897 Views
Improved by how much, and for what sizes? Are you seeing a significant difference between using a single task (grainsize strictly larger than initial range size) and a serial loop?
0 Kudos
ARCH_R_Intel
Employee
897 Views

When you say "concurrent vector of data", did you mean a plain C array or a tbb::concurrent_vector? A tbb::concurrent_vector would make the code quite slow.

Writing afast parallel FFT on a modern shared-memory machine is surprisingly tricky. A straightforward parallelization of the loop form of Cooley-Tukey has relatively poor cache behavior and high synchronization overhead. Divide-and-conquer strategies generally perform much better. E.g. FFTW (http://www.fftw.org/) uses divide-and-conquer strategries. See also Figure 2.9 of http://www.cs.berkeley.edu/~vazirani/algorithms/chap2.pdf .

I believe it is possible to write a parallel FFT using only two parallel_for invocations, but do not know the details. See http://www.fftw.org/parallel/parallel-fftw.html , the subsection titled "One-dimensinoal Parallel FFT" for a hint of how to do this.

0 Kudos
Ahmed_Numan
Beginner
897 Views
My new code is

if(butterfly_index2 < threshold)
parallel_for(blocked_range(stage2,SIZE),FFT);
else
{
for(int k = stage2; k < SIZE; k+=butterfly_index2)
{
temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;
temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;
X[k + butterfly_index1].real = X.real - temp_real;
X[k + butterfly_index1].img = X.img - temp_img;
X.real = X.real + temp_real;
X.img = X.img + temp_img;
}

which switches between serial and parallel versions depending upon value of threshold that is determined experimently. As this is not computationally extensive code i was able to getperformance improvement maximum upto 10%. However it i add a redundent computational extensive code (i.e for(int test=0;test<10000;test++);) just for experiment then performance improvement exceed upto 60% on dual core system and 600% on 8 core machine.
using customly just one thread to evaluate the FFT (by giving chunk size greater than orignal iteration space) slightly degrades performance (5 to 6 %).
0 Kudos
ARCH_R_Intel
Employee
897 Views
Here's an outline of how to do it in TBB efficiently with just two parallel_for invocations. It's been a long time since I've written an FFT and there may be off-by-one sort of errors an missing parameters. But I'm fairly sure the basic idea is correct.

[bash]#include 
#include "tbb/tbb.h"

// Do decimation on x[0], x[stride], x[stride*2], ... x[stride*(n-1)]
// for all stages s in [sFirst,sLast).
void Decimate( int sFirst, int sLast, std::complex x[], size_t n, size_t stride );

// In-place FFT of array x that has length 1< x[], int lgN ) {
    // Chop into k subproblems of size m and solve them
    int lgM = lgN/2;
    size_t m = 1<[/bash]
Though for a 1024 point FFT, it may be difficult to get much speedup anyway. Each subproblem will have the flow structure of a 32-point FFT (albeit with different twiddle factors). There's O(N lg N) operations on each of these subproblems. With N=32, the number of operations is going to fall far short of the rule-of-thumb 10,000-100,000 operations per task to get good speedup. It might be better to look at a higher level in the code to parallelize.
0 Kudos
Ahmed_Numan
Beginner
897 Views
It is a good idea to parallelize the array on which FFT is required to be applied rather than directly parallelizing the FFT algorithm as FFT algorithm just require some multiplications and it is not computationally intensive and does not give appriciable performance improvement. Applying FFT simultanuously to different chunk of data as in example code by robison is more efficient than applying parallelized FFT to whole data.
I think it will be better to allow TBB to decide the stride size rather that statically assigning the stride size. it will use the hardware resources efficiently i guess.
0 Kudos
Reply