- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am currently working on a highly optimized code, designed to run in an HPC environment.

This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.

This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.

The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.

The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.

This problem gets worse as I decrease the number of points and is alleviated as I increase them.

For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.

I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.

I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.

The ratios I measured look reasonable for both 1200 and 3600 points:

For 1200 point:

CPI = 0.80759

Parallelization_ratio = 0.91599

Modified_data_sharing_ratio = 0.00087244

L2_cache_miss = 324000

Branch_misprediction_ratio = 0.0077442

Bus_utilization_ratio = 0.18217For 3600 points:

CPI = 0.78496

Parallelization_ratio = 0.99238

Modified_data_sharing_ratio = 0.00085696

L2_cache_miss = 1089000

Branch_misprediction_ratio = 0.0073767

Bus_utilization_ratio = 0.2105According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.

Is there another important ratio or event that may indicate what the probem is ?

Thank you in advance,

Ianir.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In order to get more parallelization_ratio underproduction load your algorithm requires more work for each thread, about 100000 or more instructions per thread.

My suggestions is to make task based parallelization, where 1 task = 1 image in case you have large amount of images.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am afraid I can not post the actual code.

I can show the for-loop pragma:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)

for(p = 0 ; p < K ; p++)

{

DOSOMETHING;

}

K is the number of points.

P_pt is the point information (x,y, and other data).

start_output_pt is a pointer to the ouput data for each point.

Can you elaborate on why parallelization only works efficiently for 100K+ instructions ?

I made an experiment and unrolled the for-loop to:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)

for(p = 0 ; p < K ; p=p+2)

{

DOSOMETHING;

DOSOMETHING;

}

This had no impact on running time, even though each thread now has twice the workload.

Any ideas ?

Thanks,

Ianir.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

please try:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided, X)

for(p = 0 ; p < K ; p++) {DOSOMETHING;}

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(dynamic, X)

for(p = 0 ; p < K ; p++) {DOSOMETHING;}

where X= K/m/num_threads, m = from2to 10;

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

as i see, you used guided scheduling with default chunk size = 1, that means there are K simultaious tasks for 8 cores. Each task will be sheduled separately in order requested by threads. Scheduling is overhead which degradate your parallelization ratio.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

float * P_pt = (float *) malloc(K*4*sizeof(float));

(K is typically 1200-2000).

The operations I do are basically a form of histograms in a spatial neighborhood.

The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

lets try to clarify the rule of assigning of iterations in the guided scheduling:

P1 = C*K/(num_threads - 1);

P2 = C*(K-P1)/(num_threads - 1);

P3 = C*(K-P1-P2)/(num_threads - 1);

...

Pn = C*(K-P(n-1)...-P1)/(num_threads - 1);

... while Pn > chunksize

P(n+1) = chunksize

...

P(n+m-1) = chunksize

P(n+m) = rest iters <= chunksize

where C <= 1

in case default chunksize=1

for K=1200, C=0.5 we have about 81 schedules, 14,6 iterations in average

for K=2000, C=0.5 we have about 89 schedules, 22,5 iterations in average -- more work per thread, that's because you have better ratio for K=2000

in case chunksize=20

for K=1200, C=0.5 we have about34 schedules, 35,3 iterations in average

for K=2000, C=0.5 we have about41 schedules,48,8 iterations in average

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

pay attentionto false sharing issues

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I tried the dynamic scheduling with different chunk sizes. What I saw was a decrease of CPU time spent in libiomp5.so, which is good - the scheduling overhead was reduced, however, the runtime for my code has not changed, speedup is not up to spec.

As far as I understand, false sharing happens when writes invalidates the cache line. I am not sure this is the case here, since increasing the number of points (which means more for-loop iterations) led to the expected speedup. Moreover, the VTune params I measured do not seem to indicate to such a problem. Is there a hardware event in VTune that will measure such cases ?

Thanks again, I appreciate the effort.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How about guided scheduling? I'm interested in results))

>>Is there a hardware event in VTune that will measure such cases ?

I don't know.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Ideally you want to partition the work evenly with as few as thread scheduling interactions occuring. When your neighborhood sizes are relatively large (and of various sizes) then you might want to use a small chunk size (of number ofneighborhoods). When the size of the neighborhood is relatively small, then a larger chunk size would be in order. Therefore, consider partitioning your work in groups by size of neighborhood. Then use varying chunk size inversely proportional to the relative neighborhood size.

large neighborhood sized... chunk = 1

medium neighborhood sized... chunk = n ! n tbd

small neighborhood sized... chunk =m !m tbd

As to if you use 2, 3, ..., x different sized groupings, this would depend on your application.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I think I will try to implement Jim Dempsey's suggestion, it looks like it may help with reducing scheduling overhead.

It still doesn't solve the main problem - my main functions seem to work better when the workload is higher (say, number of iterations, or neighborhood size inside the function), but it is a start.

Is there a recommended book that deals with such issues ?

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

sorry, i have no any idea while i know nothing about calculations inside iterations ((

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Just one more thing, you mentioned before that parallelization will work well if I have more than 100,000 instructions. Do you know what is the reason for degradation for a lower number of instructions ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

TBB documentation suggests at least 100K clock cycle. I just "a little bit" increased to instructions.

see "Tutorial", search for"

**Packaging Overhead Versus Grainsize**"

http://threadingbuildingblocks.org/documentation.php

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You could look at the asm output file from your compiler and get an idea about the function calls implied by your OpenMP usage. Details under Intel OpenMP are intentionally hidden; with gnu compilers, you should be able to capture the full source code with expansion of OpenMP directives.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The E5420 is a 4 core processor (no HT). Does this mean you have two such processors?

------------

>>This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.

Does this mean your code is somewhat along the line of

for(image=0; image .lt. nImages; ++image)

{

ReadImage(image);

#pragma omp parallel for

for(point=0; point .lt. nPoints; ++point)

processPoint(point);

}

IOW the app is a high frequency of:

serial (with I/O)

parallel

serial (with I/O)

parallel

...

If so, then I suggest you rewrite the code to use a parallel pipeline technique.

The QuickThread threading toolkit (which I produce by the way) excells at parallel pipelines containing a mixture ofI/O and compute class pipes. We can discuss parallel further if you think this would be benificial to your application.

Jim

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page