- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am currently working on a highly optimized code, designed to run in an HPC environment.
This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.
This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.
The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.
The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.
This problem gets worse as I decrease the number of points and is alleviated as I increase them.
For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.
I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.
I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.
The ratios I measured look reasonable for both 1200 and 3600 points:
For 1200 point:
For 3600 points:
According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.
Is there another important ratio or event that may indicate what the probem is ?
Thank you in advance,
Ianir.
I am currently working on a highly optimized code, designed to run in an HPC environment.
This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.
This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.
The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.
The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.
This problem gets worse as I decrease the number of points and is alleviated as I increase them.
For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.
I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.
I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.
The ratios I measured look reasonable for both 1200 and 3600 points:
For 1200 point:
CPI = 0.80759
Parallelization_ratio = 0.91599
Modified_data_sharing_ratio = 0.00087244
L2_cache_miss = 324000
Branch_misprediction_ratio = 0.0077442
Bus_utilization_ratio = 0.18217For 3600 points:
CPI = 0.78496
Parallelization_ratio = 0.99238
Modified_data_sharing_ratio = 0.00085696
L2_cache_miss = 1089000
Branch_misprediction_ratio = 0.0073767
Bus_utilization_ratio = 0.2105According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.
Is there another important ratio or event that may indicate what the probem is ?
Thank you in advance,
Ianir.
Link Copied
- « Previous
-
- 1
- 2
- Next »
21 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, that was my guess, but I wanted to make sure. Thanks for the link.
I investigated this issue very thoroughly in the last days and managed to improve results by randomizing the data order (the order makes no difference to my applicaiton), it appears there was some sort of correlation in what I assumed to be random. This reduced overhead and now I get better performance.
I would like to thank everyone that contributed to this thread - I learned a lot from it.
So thanks again,
Ianir.
I investigated this issue very thoroughly in the last days and managed to improve results by randomizing the data order (the order makes no difference to my applicaiton), it appears there was some sort of correlation in what I assumed to be random. This reduced overhead and now I get better performance.
I would like to thank everyone that contributed to this thread - I learned a lot from it.
So thanks again,
Ianir.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »