Load Balancing Between Two Versions of an Algorithm...

victor_d_1 · ‎12-17-2016

I'd like to use two versions of an algorithms to split the total work (such as a very larger array where each element needs to be processed by the algorithm). Is it possible to use TBB to split the work of all of the array elements in such a way that the faster version of the algorithm will process more elements of the array and the slower will process fewer elements of the array. For example, if the faster version is 2X faster then it will process 2/3 of all of the array elements and the slower version will process 1/3.

Alexei_K_Intel · ‎12-18-2016

Could you explain some details about your algorithm, please? Do you need to process different elements of the array depending on its value? E.g. if the element positive then run one algorithm, if negative than run another algorithm. How the algorithm decide if it should use the fast algorithm or the slow one?

Or, do you want to run two tasks: one for the first algorithm and another for the second algorithm? But why do you have different work in one array?

Regards, Alex

victor_d_1 · ‎12-18-2016

The processing of the data by both algorithms is exactly the same. Think of the two algorithms as workers that work at different speeds: one is faster than the other, but both do exactly the same work. Working together they process faster than each alone, because they can work in parallel. The work in the array is exactly the same. I just want to be able to use two workers in parallel that each work at different speed/rate.

Alexei_K_Intel · ‎01-26-2017

Could you clarify your idea? Is your algorithm initially serial? Is it some loop processing elements in the array, e.g.

for ( int i = 0; i<N; ++i )
    doSomething( arr );

Regards,
Alex

victor_d_1 · ‎02-19-2017

Let say one algorithm is written in C++ and another is optimized using SSE/SIMD instructions, both running on multi-core CPU. I'd like them to split the work in such a way where the faster version of the algorithm would do more work than the slower one, proportionally.

Vladimir_P_1234567890 · ‎02-20-2017

For example you can use token-based approach.

But why you don't want to use sse/simd kernel ("algorithm") for all elements in the iteration space? So you can use parallel_for to parallel the iteration space and simd kernel for the blocked range. It will match the VIPO approach (vectorize on innermost level and parallel on outermost one).

--Vladimir

jimdempseyatthecove · ‎02-22-2017

atomic<int> pick;
pick = 0;
parallel_invoke(
  [&](){ for(int i=pick++; i<N; i=pick++) doMethodOne( arr); },
  [&](){ for(int i=pick++; i<N; i=pick++) doMethodDwo( arr); }
);

.OR.

atomic<int> pick;
pick = 0;
int stride = 2;
parallel_invoke(
  [&](){ for(int i = pick.fetch_add(stride); i<N; i=pick.fetch_add(stride))
                  for(int j = i; j < std::min(j+stride, N); ++j)
                          doMethodOne( arr); },
  [&](){ for(int i = pick.fetch_add(stride); i<N; i=pick.fetch_add(stride))
                  for(int j = i; j < std::min(j+stride, N); ++j)
                          doMethodTwo( arr); }
);

Note, the above is using only two threads. You would process elements individually whenever the doMethod... cost is relatively high.
you may wish to pick multiple elements (batch size) when the doMethod... is relatively low. As to choce of batch size (stride), that is TBD.

Jim Dempsey