Poor parallelization for medium workloads - How does workload impact parallelization ? - Page 2

Ianir_Ideses · ‎02-07-2011

Hi,
I am currently working on a highly optimized code, designed to run in an HPC environment.
This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.

This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.
The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.

The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.
This problem gets worse as I decrease the number of points and is alleviated as I increase them.

For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.

I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.

I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.

The ratios I measured look reasonable for both 1200 and 3600 points:

For 1200 point:

CPI = 0.80759
Parallelization_ratio = 0.91599

Modified_data_sharing_ratio = 0.00087244

L2_cache_miss = 324000

Branch_misprediction_ratio = 0.0077442

Bus_utilization_ratio = 0.18217

For 3600 points:

CPI = 0.78496

Parallelization_ratio = 0.99238

Modified_data_sharing_ratio = 0.00085696

L2_cache_miss = 1089000

Branch_misprediction_ratio = 0.0073767

Bus_utilization_ratio = 0.2105

According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.
Is there another important ratio or event that may indicate what the probem is ?

Thank you in advance,
Ianir.

Ianir_Ideses · ‎02-10-2011

Thanks, that was my guess, but I wanted to make sure. Thanks for the link.

I investigated this issue very thoroughly in the last days and managed to improve results by randomizing the data order (the order makes no difference to my applicaiton), it appears there was some sort of correlation in what I assumed to be random. This reduced overhead and now I get better performance.

I would like to thank everyone that contributed to this thread - I learned a lot from it.

So thanks again,
Ianir.