How to compare parallel vs serial implementation?

Petros · ‎05-05-2011

Hi,

I have a serial program for time-domain simulation which I'm parallelizing incrementally. I want to compare the performance of the serial and the parallel implementation. The system I use for the simulations is shared, so, when I run the serial program several times on the same data (through vtune), I get different elapsed times. The same with the parallel implementation.

How can I compare the two implementations? Should I compare the cpu time (which seems to be the same at every run)? If yes, the cpu time of the parallel program, should be divided by the value from the cpu usage diagram to be fare? (which is around 1.55)

Any citations towards some scientific way to compare the two (or even some keywords) will be much appreciated!

Thanks in advanced,
Petros

TimP · ‎05-08-2011

If you don't have control over the shared workloads, you can't be "scientific" about this. If you are in fact incurring the same total CPU time while gaining effective parallelism, you are doing very well.

jimdempseyatthecove · ‎05-08-2011

Petros,

When the system is shared with other users/applications some knowledge about the number of cores andusage patterns may be helpful in designing your performance testing scenario.

One method might be to run your performance testing at 3:00 AM when (if) the system is mostly idle (assuming no nightly backup is running).

Another method is, assuming 2P system (2x 4 core w/HT), to affinity pin your test application to the second processor (or half the threads) under an assumption that your system will have periods where it will not have more than 1/2 a processing load by other users applications. Then make 3-5 test runs and assume the lowest number reflects a run time representative of little or no load.

Note, the tuning efforts using a diminished number of cores will pertain for that number of(and those pinned) cores but not the full complement of cores, however generally you will find what works good for a few cores tends to work good for more cores.

After you get the diminished number of cores working to your satisfaction, then experiment with using all cores as well as using various numbers of cores and at various time of the day. What you are looking for is a tuning configuration that is not only beneficial to your application but also not too detrimental to the other applications. Note, if all programs are programmed to use all of the resources then when running multiple such programs the system will be thrashing more frequently (causing a higher degree of cache evictions). And this would be detrimental to your program.

If you have VTune try to tune such to reduce the number of cache evictions (at each level).

Jim Dempsey

TimP · ‎05-08-2011

I believe Jim is right to point out the increased importance of cache locality in your program when running on a shared system.

Christopher_K_ · ‎05-10-2011

"How can I compare the two implementations? Should I compare the cpu time (which seems to be the same at every run)? If yes, the cpu time of the parallel program, should be divided by the value from the cpu usage diagram to be fare? (which is around 1.55)"

Yes...thinking about multi-processor theory and substituting software variables for hardware variables, I do believe this is how you can get your efficiency rating. Now, I may be wrong...that's from a formula for multi-processor efficiency. I'm suspecting the number of threads should be in the formula somewhere...but I am no mathematician. Best wishes.

- CCK