Showing results for 
Search instead for 
Did you mean: 

Threaded Programs with Built-in Load-Imbalance

I am looking for cases where there is built-in load imbalance between threads (unsolvable by such techniques as dynamic allocation).
Can anyone point me to programs/benchmarks that have this type of behavior?
(Functional or data-pipelined decomposition could have examples of this. In these decompositions, each thread is doing different work at certain parallel regions, so one thread typically finishes its work for that region before for the other.)
0 Kudos
2 Replies

Hi Avsha,
I don't know of any specific programs or benchmarks that contain the kind of load imbalance that you need. You might try SPEC OMP but I don't know if any of the applications in this benchmark have a load imbalance.
It shouldn't be too difficult to write simple benchmarks that do what you need. As you already mentioned, in a functional decomposition, the following parallel region contains an inherent load imbalance:
#pragma omp parallel sections
#pragma omp section
small_func ();
#pragma omp section
big_func ();
The following parallel loop can also defeat dynamic scheduling:
#pragma omp parallel for
for (i = 0; i < N; i++)
small_func ();
if (i== (N-1)) big_func ();
What are you trying to demonstrate?
Best regards,
Black Belt

Avsha -
Anything that uses OpenMP sections would be a good candidate for what you're looking for. For example,assume there are several computations (function calls) that have some dependencies as far as order of execution, but the entire sequence can be divided into two or more independent segments. Then,the set of independent segments canbe parallelized with a sections pragma, butthe work load between segments may be unbalanced.
I've recently looked at a game code that had this feature. Six function calls are used to update positions of objects in motion and how those new positions affect the enviroment and what is drawn in the next frame displayed. The first four functions could be done in one thread while the sixth function relied on the results of the fifth function, so the last two were put into a section together. Performance analysis via Thread Profiler showed the time spent in the final two functions took almost 30X longer than the total time for the first four function calls. No easy way to redistribute the work here for a better load balance.