Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Threading Building Blocks
- parallel_for for nested loops

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Ahmed_Numan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-02-2011
02:22 AM

145 Views

parallel_for for nested loops

for(stage1=0;stage1 < lim;stage1 ++)

{

butterfly_index1 = butterfly_index2;

butterfly_index2 = butterfly_index2 + butterfly_index2;

k = -1*2*PI/butterfly_index2;

k_next = 0.0;

for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)

{

sine=sin(k_next);

cosine = cos(k_next);

k_next = k_next + k;

FFT. sine = sine;

FFT.cosine = cosine;

FFT.n2 = butterfly_index2;

FFT.loop_init = stage2;

FFT.n1 = butterfly_index1;

parallel_for(blocked_range

// for(int k = stage2; k < SIZE; k+=butterfly_index2)

// {

// temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;

// temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;

// X[k + butterfly_index1].real = X

// X[k + butterfly_index1].img = X

// X

// X

// }

//

}

}

parallel_for body:

for(int k = r.begin(); k != r.end(); k++)

{

if(k !=loop_init)

{

if((k - (loop_init))% (n2)!= 0)

continue;

}

temp_real = (cosine) * X[k + n1].real - (sine) * X[k + n1].img;

temp_img = (sine)* X[k + n1].real + (cosine) * X[k + n1].img;

X[k + n1].real = X

X[k + n1].img = X

X

X

}

I am using TBB version 3.0 update 5 on fedora core 13 (64 bit). If instead of parallel_for i use commented code the performance is thousands time higer (execution time is lower).

suggest me improvements im my code

thanks

1 Solution

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-02-2011
02:57 PM

145 Views

My suggestion was to look for parallelism

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.

Link Copied

9 Replies

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-02-2011
11:23 AM

145 Views

Ahmed_Numan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-02-2011
02:02 PM

145 Views

But question is why poor performance for innermost loop parallelization in spite of controling the chunksize of workload.

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-02-2011
02:57 PM

146 Views

My suggestion was to look for parallelism

(Added) Perhaps a sizeable part of the slowdown can be explained by the comparison with r.end() in the for loop header; instead, read its value into a local value and use that in the comparison, which may allow important optimisations that the compiler will not attempt now.

(Added) It would probably be possible to automatically execute a small-enough range without spawning any task, by querying range.is_divisible() in both run() overloads (question/suggestion to Intel), but that wouldn't be of any use here with the given code asking for the evaluation of maybe 4 subranges of at most 256 iterations (which seems too small), and, even if the grainsize is set to prevent subdivision, the gain (or non-loss) from not spawning a single task may well be insignificant until the evaluation of r.end() is hoisted out of the loop as suggested above.

Ahmed_Numan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-03-2011
09:48 AM

145 Views

thanks for your assistance.

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-03-2011
11:02 PM

145 Views

ARCH_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-04-2011
07:36 AM

145 Views

Writing afast parallel FFT on a modern shared-memory machine is surprisingly tricky. A straightforward parallelization of the loop form of Cooley-Tukey has relatively poor cache behavior and high synchronization overhead. Divide-and-conquer strategies generally perform much better. E.g. FFTW (http://www.fftw.org/) uses divide-and-conquer strategries. See also Figure 2.9 of http://www.cs.berkeley.edu/~vazirani/algorithms/chap2.pdf .

I believe it is possible to write a parallel FFT using only two parallel_for invocations, but do not know the details. See http://www.fftw.org/parallel/parallel-fftw.html , the subsection titled "One-dimensinoal Parallel FFT" for a hint of how to do this.

Ahmed_Numan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-05-2011
07:43 AM

145 Views

if(butterfly_index2 < threshold)

parallel_for(blocked_range

else

{

for(int k = stage2; k < SIZE; k+=butterfly_index2)

{

temp_real = (cosine) * X[k + butterfly_index1].real - (sine) * X[k + butterfly_index1].img;

temp_img = (sine)* X[k + butterfly_index1].real + (cosine) * X[k + butterfly_index1].img;

X[k + butterfly_index1].real = X

X[k + butterfly_index1].img = X

X

X

}

which switches between serial and parallel versions depending upon value of threshold that is determined experimently. As this is not computationally extensive code i was able to getperformance improvement maximum upto 10%. However it i add a redundent computational extensive code (i.e for(int test=0;test<10000;test++);) just for experiment then performance improvement exceed upto 60% on dual core system and 600% on 8 core machine.

using customly just one thread to evaluate the FFT (by giving chunk size greater than orignal iteration space) slightly degrades performance (5 to 6 %).

ARCH_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-05-2011
11:38 AM

145 Views

[bash]#includeThough for a 1024 point FFT, it may be difficult to get much speedup anyway. Each subproblem will have the flow structure of a 32-point FFT (albeit with different twiddle factors). There's O(N lg N) operations on each of these subproblems. With N=32, the number of operations is going to fall far short of the rule-of-thumb 10,000-100,000 operations per task to get good speedup. It might be better to look at a higher level in the code to parallelize.#include "tbb/tbb.h" // Do decimation on x[0], x[stride], x[stride*2], ... x[stride*(n-1)] // for all stages s in [sFirst,sLast). void Decimate( int sFirst, int sLast, std::complex x[], size_t n, size_t stride ); // In-place FFT of array x that has length 1< x[], int lgN ) { // Chop into k subproblems of size m and solve them int lgM = lgN/2; size_t m = 1< [/bash]

Ahmed_Numan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-05-2011
12:03 PM

145 Views

I think it will be better to allow TBB to decide the stride size rather that statically assigning the stride size. it will use the hardware resources efficiently i guess.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.