How can I get the best grainsize?

kyaw0010 · ‎02-26-2008

Hi,

I'd like to ask some questions about grainsize. First, I'm now trying to apply "parallel_for" inmy simulation software. But the problem is thatI could not give the fixed grainsize because the size of the iteration loop( i.e., parallel_for loop) is not a fixed one. It depends on the model (i.e., the antenna model used in the simulation)and so I've tried to use "auto_partitioner" but the result is not as I expected. For example, if I've used the fixed grainsize for one model (that is the best grainsize I think), I can get better result (I mean better speedup in simulation time)than using with "auto_partitioner".

I've tried toreadmany articles about grainsizebut still can't get how to choose the best grainsize for my application. As we can't predict the iteration steps in the loop, and also the number of cores in the system, I don't have any idea how to get the best grainsize for all models and for all operating systems. So, can anyone help my problem?

Thanks a lot!

Alexey-Kukanov · ‎02-27-2008

...the problem is thatI could not give the fixed grainsize because the size of the iteration loop( i.e., parallel_for loop) is not a fixed one.

Grain size is somewhat unrelated to iteration space size, so you might as well use the same grain size value if it serves well, say,in more than 90% of real use cases. Remember that usually there is a whole range of possible grain sizes that serve reasonably well.

...I've tried to use "auto_partitioner" but the result is not as I expected. For example, if I've used the fixed grainsize for one model (that is the best grainsize I think), I can get better result (I mean better speedup in simulation time)than using with "auto_partitioner".

The auto_partitioner was not designed to hit the best time, but it should be reasonably good in most cases. How much did auto_partitioner lose to the best manually selected grain size, and what were your expectations? If you have a test case where the auto_partitioner loses most, could you provide it?

I've tried toreadmany articles about grainsizebut still can't get how to choose the best grainsize for my application. As we can't predict the iteration steps in the loop, and also the number of cores in the system, I don't have any idea how to get the best grainsize for all models and for all operating systems.

That's why the auto_partitioner has been designed :) Flexibility is usually paid by some performance loss. The question is, how big is the loss and if you can tolerate it.