If I declare pW and aux

Shayan_Y_ · ‎01-26-2016

Hi

I would like to convert this code to tbb, But I have not found any similiar terms in TBB for private and shared variable.

#pragma omp parallel for default(none) private(i, Aux, pW) shared(stride, n, A, B, C, W)
    for(i = 0; i <= n - 1; i++) {
      pW = W + i * stride;
      Aux.re = pW->re * C.re - pW->im * C.im;
      Aux.im = pW->re * C.im + pW->im * C.re;

      A.re = B.re + Aux.re;
      A.im = B.im + Aux.im;
      A[i+n].re = B.re - Aux.re;
      A[i+n].im = B.im - Aux.im;
    }

This is the way I re-implement with Lambda function:

        tbb::parallel_for( tbb::blocked_range<int>(0, n),
            [&]( const tbb::blocked_range<int> &r ) {
                for(int i = r.begin(), i_end = r.end(); i < i_end; i++){
                      pW = W + i * stride;
                      Aux.re = pW->re * C.re - pW->im * C.im;
                      Aux.im = pW->re * C.im + pW->im * C.re;
                      A.re = B.re + Aux.re;
                      A.im = B.im + Aux.im;
                      A[i+n].re = B.re - Aux.re;
                      A[i+n].im = B.im - Aux.im;
                }
        });

Alexei_K_Intel · ‎01-27-2016

Hi,

Why we need private and shared variables in OpenMP? Because the compiler cannot know our algorithm and what we really want. Therefore, we need to say the compiler what variables are local (private) for each iteration of the loop and what variables share their states between different loop iterations.

Why do not we need similar terms in TBB? Because the similar idea can be expressed with language constructions. Any variable passed to the body/lambda by reference(&) can be considered as "shared" because it is not copied and any its change can be visible in other iterations. However, if you need some local variables you can declare them inside the body/lambda. It should be noted that variables passed to the body/lambda by value(=) cannot be modified inside the loop in TBB.

As for your example, the TBB version has races related to the pW variable and the Aux structure. It can be fixed if you declare these variables inside the loop:

tbb::parallel_for( tbb::blocked_range<int>(0, n),
    [&]( const tbb::blocked_range<int> &r ) {
        for(int i = r.begin(), i_end = r.end(); i < i_end; i++){
              auto pW = W + i * stride;
              decltype(Aux) aux;
              aux.re = pW->re * C.re - pW->im * C.im;
              aux.im = pW->re * C.im + pW->im * C.re;
              A.re = B.re + aux.re;
              A.im = B.im + aux.im;
              A[i+n].re = B.re - aux.re;
              A[i+n].im = B.im - aux.im;
        }
});

Shayan_Y_ · ‎01-27-2016

If I declare pW and aux outside the for, are they still private? Something like this:

        tbb::parallel_for( tbb::blocked_range<int>(0, n),
            [&]( const tbb::blocked_range<int> &r ) {
                Complex Aux, *pW;             
                for(int i = r.begin(), i_end = r.end(); i < i_end; i++){
                      pW = W + i * stride;
                      Aux.re = pW->re * C.re - pW->im * C.im;
                      Aux.im = pW->re * C.im + pW->im * C.re;
                      A.re = B.re + Aux.re;
                      A.im = B.im + Aux.im;
                      A[i+n].re = B.re - Aux.re;
                      A[i+n].im = B.im - Aux.im;
                }
        });

Shayan_Y_ · ‎01-27-2016

There is a huge decrease in performance between OpenMP and TBB in this example. TBB is really slow in this example

RafSchietekat · ‎01-27-2016

Maybe because the compiler can vectorise the OpenMP code but not the TBB code? You have to take advantage of both levels of parallelism, so you may have to give the compiler a hint for the inner level, and reassure it that those i and i+n locations won't interfere with each other, perhaps with "#pragma simd" or somesuch. Start with that, and then you can start worrying about how to deal with alignment if you want.

BTW, bringing those declarations outside the for loop won't affect correctness, but you're not gaining any performance that way, I think. The compiler has many tricks up its sleeve that make apparent reuse of memory locations irrelevant.

(2016-01-29 Added) The declarations must be inside if vectorisation is added, though.

jimdempseyatthecove · ‎01-28-2016

The OpenMP implementation might assume that pW is not aliased with A. The TBB implementation might not make that assumption. You might try attributing the pointer pW with restrict, _restrict, __restrict (whatever your compiler uses).

Lacking that you could remove pW-> and use W[i * stride]. which should experience common sub-expression elimination.

It is an observation of mine that the vectorization optimization process often loses opportunities when pointers are involved as opposed to using subscripts.

Jim Dempsey

Alexei_K_Intel · ‎01-29-2016

Could you provide an example that demonstrates the slowdown of TBB against OpenMP, please? If not possible, could you provide more information about your experiments? How big is n? What is the data type used inside the Complex (float or double)? What machine do you use in your experiments (how many sockets/cores/threads, bitness of application 32 or 64). Do you have other parallel regions (OpenMP or TBB) in your application?

As for vectorization, if you use Intel Compiler you can use the /Qopt-report /Qopt-report-phase:vec switch to check if the loop is vectorized or not.

Shayan_Y_ · ‎01-31-2016

Yes as I mentioned earlier I have this code for Fast Fourier Transform with OpenMP.which you can download it from http://sourceforge.net/projects/ompscr/. It has been located in \application\c_FFT.

The code I have pushed the code that I re implemented in this repository https://github.com/shayany/FFT

Thanks

Alexei_K_Intel · ‎01-31-2016

It looks like that there is an issue in FFT/tbb_cpp__fft.cpp:142:

tbb::parallel_for(tbb::blocked_range<size_t>(0, 1),ApplySolver(A,a,W,n,stride,D),ap1);

The range is [0;1) and TBB cannot know that index "1" is processed in ApplySolver. So only one task will be created and the code is serialized. With OpenMP the compiler knows that

for ( i=0; i<=1; i++ ) ...

contains two iterations.

As for affinity_partitioner, I am not sure that it has sense here because the algorithm is not repeated many times. It is better not to specify any partitioner in this case.

RafSchietekat · ‎01-31-2016

So it was a trick question (I only looked at the code presented here)... Just to state the obvious, "tbb::blocked_range<size_t>(0, 2)" would solve that problem, so it's not a TBB limitation, or at most a trivial one.

I have a feeling that "#pragma simd" would also solve any inter-array aliasing issues, but I may well be mistaken about that. On the other hand, "restrict" would certainly not inform the compiler that n is larger than r.size() or that therefore intra-array aliasing is nothing to worry about. Let us know what combination(s) work(s); my guess is the pragma with or, quite likely, without the restrict.

Alexei_K_Intel · ‎02-01-2016

Hi Raf,

I'd better suggest using #pragma ivdep because #pragma simd should be used with a care (it can broke not only performance but the correctness). Consider the following example:

#include <iostream>
#include <vector>

void prefixSumIvdep( int *a, int *b, int n ) {
#pragma ivdep
    for ( int i = 0; i < n; ++i ) {
        b += a;
    }
}

void prefixSumSimd( int *a, int *b, int n ) {
#pragma simd
    for ( int i = 0; i < n; ++i ) {
        b += a;
    }
}

int main() {
    const int N = 100;

    {
        std::vector<int> a( N, 1 );
        prefixSumIvdep( a.data(), a.data()+1, N-1 );
        std::cout << "ivdep: The last element of the prefix sum is " << a.back() << std::endl;
    }

    {
        std::vector<int> a( N, 1 );
        prefixSumSimd( a.data(), a.data()+1, N-1 );
        std::cout << "simd:  The last element of the prefix sum is " << a.back() << std::endl;
    }

    return 0;
}

The output is

ivdep: The last element of the prefix sum is 100
simd:  The last element of the prefix sum is 2

I agree that it is not a very good example because the loop cannot be parallelized (easily), however, it demonstrates the #pragma simd potential hazard. The interesting discussion can be found on StackOverflow.

jimdempseyatthecove · ‎02-01-2016

Raf,

This is not a case of "it demonstrates the #pragma simd potential hazard".
Rather this is a case of "misuse of the #pragma simd is hazardous".

The simd pragma is a binding contract between the programmer and the compiler, where the programmer declares no vector dependency for the loop that follows.

Jim Dempsey

RafSchietekat · ‎02-01-2016

Jim, I presume you meant to address Alex, not me. And I concur.

Generally, the safer ivdep is indeed recommended (never mind a few extra instructions and some bloat for a fallback implementation that might never be executed, if it protects you against occasional silly mistakes in a large body of code), but my understanding is that sometimes the compiler still gets spooked unnecessarily, which you would only find out by inspecting the vectorisation report, or inserts a test that is too conservative, which you would only find out by comparative benchmarking (please correct me if I'm mistaken). However, if you're starting with a presumably correct and tested "omp parallel" for loop, and want to vectorise TBB-fabricated chunks of it, I think you have a good use case for using simd and thereby forcing the compiler's hand.

Besides, is this Java or what? No, it's C++, the language for real programmers, who eat undefined behaviour for breakfast! :-)

Convert parallel for from OpenMP to TBB