Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

"Unbalanced" parallel scan

I'm working on moving some legacy code that we have over to parallel algorithms. I've run across a weird one today that will work well with a parallel scan, however the "Initial Scan" is much simpler than the "Final Scan" and I'm wondering if TBB will handle that well. Generally, from what I've found in the docs, a the initial and final scan in a parallel scan do about the same amount of work/take the same amount of time. E.g., in a cumulative sum, the initial scan calculates the running sum and the final scan calculates the running sum and stores it. The only difference between the two is the extra store step. Here's a simplified version of the algorithm I'm trying to replicate (the resultant data structure CANNOT be changed, it's been in 30 years of legacy code and I have to live with it).

Let's say I'm given N arrays of various sizes. I need to do some complicated math on each element and add that element to another array. The format of my output array essentially starts with a table of contents that contain indices of where each of the input arrays start. It's probably best explained with an example.

Array #1 = a, b, c, d, e
Array #2 = f, g, h, i, j, k
Array #3 = l, m

Output = 4, 9, 15, 17, a, b, c, d, e, f, g, h, i, j, k, l, m

That means, that array #1 starts at index 4 in the output array, array #2 starts at index 9, and so on.

So for my parallel scan, my initial scan is to compute a cumulative sum of the starting index for each of the arrays, then the final scan will have the computation of the starting index, plus a copy of the data over to the new array (with some optional math to be done on each element as it's copied over). In this scenario, the final scan will take more time proportionally to the initial scan.

The reason that I'm worried that the final scan will take longer is that typically a parallel scan doesn't help unless you have more than 2 processors because it has to go through the list twice, so it's possible that TBB is smart enough to know that if only 2 processors are available, that it should not the scan in parallel mode. But in my case the parallel mode will help. I haven't fully debugged my algorithm yet, but this is in the back of my mind bothering me. I wanted to check with TBB experts to see if I really should worry, or if it will all work out in the end anyway.
0 Kudos
2 Replies
Valued Contributor III
TBB will revert to a single scan only if it has a single available thread, and I don't think it assumes equal work on each pass, so it should suit your purpose.
0 Kudos

This sounds like an interesting example of parallel_scan. I'd really like to see the final algorithm.

Idon't expect thework imbalancebetween the first andsecond passes to be a problem.For 1 processor, the current implementation does indeed do only the final scan. For 2 processors, it does the pre_scan pass for approximately the second half of the data, though the "approximately" may vary widely. I.e., on average it requires 1.5 passes with 2 processors.

0 Kudos