Poor parallelization for medium workloads - How does workload impact parallelization ?

Ianir_Ideses · ‎02-07-2011

Hi,
I am currently working on a highly optimized code, designed to run in an HPC environment.
This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.

This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.
The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.

The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.
This problem gets worse as I decrease the number of points and is alleviated as I increase them.

For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.

I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.

I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.

The ratios I measured look reasonable for both 1200 and 3600 points:

For 1200 point:

CPI = 0.80759
Parallelization_ratio = 0.91599

Modified_data_sharing_ratio = 0.00087244

L2_cache_miss = 324000

Branch_misprediction_ratio = 0.0077442

Bus_utilization_ratio = 0.18217

For 3600 points:

CPI = 0.78496

Parallelization_ratio = 0.99238

Modified_data_sharing_ratio = 0.00085696

L2_cache_miss = 1089000

Branch_misprediction_ratio = 0.0073767

Bus_utilization_ratio = 0.2105

According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.
Is there another important ratio or event that may indicate what the probem is ?

Thank you in advance,
Ianir.

Ilnar · ‎02-07-2011

I suppose, there are high parallelization_ratio only under vtune.
In order to get more parallelization_ratio underproduction load your algorithm requires more work for each thread, about 100000 or more instructions per thread.
My suggestions is to make task based parallelization, where 1 task = 1 image in case you have large amount of images.

Ilnar · ‎02-07-2011

could you please show whole code or some pices of code: image declaration (like char image[1200]) and parallelized for cycle including pragma line?

Ianir_Ideses · ‎02-07-2011

Thank you for the quick reply.
I am afraid I can not post the actual code.
I can show the for-loop pragma:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)
for(p = 0 ; p < K ; p++)
{

DOSOMETHING;
}

K is the number of points.
P_pt is the point information (x,y, and other data).
start_output_pt is a pointer to the ouput data for each point.

Can you elaborate on why parallelization only works efficiently for 100K+ instructions ?
I made an experiment and unrolled the for-loop to:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)
for(p = 0 ; p < K ; p=p+2)
{

DOSOMETHING;
DOSOMETHING;
}

This had no impact on running time, even though each thread now has twice the workload.

Any ideas ?

Thanks,
Ianir.

timintel · ‎02-07-2011

That's excellent parallel speedup for E54xx with schedule(guided). I suppose you set guided because you expect some loop iterations to do significantly more work than others. Possibly, particularly when there is just enough work to use all the cores, you will get better performance without schedule(guided), as setting affinity (e.g. KMP_AFFINITY=compact) works more effectively without (guided).

Gennady_F_Intel · ‎02-07-2011

I'mjustinterested-whatspecific operationswith the images are you doing?

Ilnar · ‎02-07-2011

Could you please get something about point and P_pt declaration? at least sizeof

Ilnar · ‎02-07-2011

did you try other types of schedule? it seems you can get more performance using dynamic or or guieded WITH specified chunk size. it's like your unrolling (for(p = 0 ; p < K ; p=p+2)), but more simpler way
please try:

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided, X)
for(p = 0 ; p < K ; p++) {DOSOMETHING;}

#pragma omp parallel for shared(P_pt, start_output_pt) schedule(dynamic, X)
for(p = 0 ; p < K ; p++) {DOSOMETHING;}

where X= K/m/num_threads, m = from2to 10;

Ilnar · ‎02-07-2011

about 100K+ instructions:
as i see, you used guided scheduling with default chunk size = 1, that means there are K simultaious tasks for 8 cores. Each task will be sheduled separately in order requested by threads. Scheduling is overhead which degradate your parallelization ratio.

Ianir_Ideses · ‎02-07-2011

Yes:
float * P_pt = (float *) malloc(K*4*sizeof(float));
(K is typically 1200-2000).

The operations I do are basically a form of histograms in a spatial neighborhood.

The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks.

Ilnar · ‎02-07-2011

lets try to clarify the rule of assigning of iterations in the guided scheduling:
P1 = C*K/(num_threads - 1);
P2 = C*(K-P1)/(num_threads - 1);
P3 = C*(K-P1-P2)/(num_threads - 1);
...
Pn = C*(K-P(n-1)...-P1)/(num_threads - 1);
... while Pn > chunksize
P(n+1) = chunksize
...
P(n+m-1) = chunksize
P(n+m) = rest iters <= chunksize
where C <= 1

in case default chunksize=1
for K=1200, C=0.5 we have about 81 schedules, 14,6 iterations in average
for K=2000, C=0.5 we have about 89 schedules, 22,5 iterations in average -- more work per thread, that's because you have better ratio for K=2000

in case chunksize=20
for K=1200, C=0.5 we have about34 schedules, 35,3 iterations in average
for K=2000, C=0.5 we have about41 schedules,48,8 iterations in average

Ilnar · ‎02-07-2011

What means magic 4 in "float * P_pt = (float *) malloc(K*4*sizeof(float));"?
pay attentionto false sharing issues

Ianir_Ideses · ‎02-07-2011

4 is the number of data elements inside each P_pt point.

I tried the dynamic scheduling with different chunk sizes. What I saw was a decrease of CPU time spent in libiomp5.so, which is good - the scheduling overhead was reduced, however, the runtime for my code has not changed, speedup is not up to spec.

As far as I understand, false sharing happens when writes invalidates the cache line. I am not sure this is the case here, since increasing the number of points (which means more for-loop iterations) led to the expected speedup. Moreover, the VTune params I measured do not seem to indicate to such a problem. Is there a hardware event in VTune that will measure such cases ?

Thanks again, I appreciate the effort.

Ilnar · ‎02-07-2011

>> I tried the dynamic scheduling with different chunk sizes
How about guided scheduling? I'm interested in results))

>>Is there a hardware event in VTune that will measure such cases ?
I don't know.

jimdempseyatthecove · ‎02-07-2011

>>The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks

Ideally you want to partition the work evenly with as few as thread scheduling interactions occuring. When your neighborhood sizes are relatively large (and of various sizes) then you might want to use a small chunk size (of number ofneighborhoods). When the size of the neighborhood is relatively small, then a larger chunk size would be in order. Therefore, consider partitioning your work in groups by size of neighborhood. Then use varying chunk size inversely proportional to the relative neighborhood size.

large neighborhood sized... chunk = 1
medium neighborhood sized... chunk = n ! n tbd
small neighborhood sized... chunk =m !m tbd

As to if you use 2, 3, ..., x different sized groupings, this would depend on your application.

Jim Dempsey

Ianir_Ideses · ‎02-07-2011

I tried several values for guided chunks, did not see any improvement, but setting it too high did cause the runtime to increase (it also shows in the VTune, as I see threads waiting).

I think I will try to implement Jim Dempsey's suggestion, it looks like it may help with reducing scheduling overhead.

It still doesn't solve the main problem - my main functions seem to work better when the workload is higher (say, number of iterations, or neighborhood size inside the function), but it is a start.

Is there a recommended book that deals with such issues ?

Thanks.

Ilnar · ‎02-08-2011

>> Is there a recommended book that deals with such issues ?

sorry, i have no any idea while i know nothing about calculations inside iterations ((

Ianir_Ideses · ‎02-08-2011

OK.
Just one more thing, you mentioned before that parallelization will work well if I have more than 100,000 instructions. Do you know what is the reason for degradation for a lower number of instructions ?

Ilnar · ‎02-08-2011

answer: scheduling overhead.
TBB documentation suggests at least 100K clock cycle. I just "a little bit" increased to instructions.
see "Tutorial", search for"Packaging Overhead Versus Grainsize"
http://threadingbuildingblocks.org/documentation.php

TimP · ‎02-08-2011

Chapman, Jost, van der Pas "Using OpenMP" is among the few good textbooks on OpenMP. They have a brief discussion of typical overheads in OpenMP, not including some important ones, such as the influence of affinity and cache issues.
You could look at the asm output file from your compiler and get an idea about the function calls implied by your OpenMP usage. Details under Intel OpenMP are intentionally hidden; with gnu compilers, you should be able to capture the full source code with expansion of OpenMP directives.

jimdempseyatthecove · ‎02-08-2011

>>This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit

The E5420 is a 4 core processor (no HT). Does this mean you have two such processors?

------------

>>This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.

Does this mean your code is somewhat along the line of

for(image=0; image .lt. nImages; ++image)
{
ReadImage(image);
#pragma omp parallel for
for(point=0; point .lt. nPoints; ++point)
processPoint(point);
}

IOW the app is a high frequency of:

serial (with I/O)
parallel
serial (with I/O)
parallel
...

If so, then I suggest you rewrite the code to use a parallel pipeline technique.
The QuickThread threading toolkit (which I produce by the way) excells at parallel pipelines containing a mixture ofI/O and compute class pipes. We can discuss parallel further if you think this would be benificial to your application.

Jim