- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I am currently working on a highly optimized code, designed to run in an HPC environment.
This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.
This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.
The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.
The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.
This problem gets worse as I decrease the number of points and is alleviated as I increase them.
For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.
I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.
I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.
The ratios I measured look reasonable for both 1200 and 3600 points:
For 1200 point:
CPI = 0.80759
Parallelization_ratio = 0.91599
Modified_data_sharing_ratio = 0.00087244
L2_cache_miss = 324000
Branch_misprediction_ratio = 0.0077442
Bus_utilization_ratio = 0.18217For 3600 points:
CPI = 0.78496
Parallelization_ratio = 0.99238
Modified_data_sharing_ratio = 0.00085696
L2_cache_miss = 1089000
Branch_misprediction_ratio = 0.0073767
Bus_utilization_ratio = 0.2105According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.
Is there another important ratio or event that may indicate what the probem is ?
Thank you in advance,
Ianir.
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
In order to get more parallelization_ratio underproduction load your algorithm requires more work for each thread, about 100000 or more instructions per thread.
My suggestions is to make task based parallelization, where 1 task = 1 image in case you have large amount of images.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I am afraid I can not post the actual code.
I can show the for-loop pragma:
#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)
for(p = 0 ; p < K ; p++)
{
DOSOMETHING;
}
K is the number of points.
P_pt is the point information (x,y, and other data).
start_output_pt is a pointer to the ouput data for each point.
Can you elaborate on why parallelization only works efficiently for 100K+ instructions ?
I made an experiment and unrolled the for-loop to:
#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)
for(p = 0 ; p < K ; p=p+2)
{
DOSOMETHING;
DOSOMETHING;
}
This had no impact on running time, even though each thread now has twice the workload.
Any ideas ?
Thanks,
Ianir.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
please try:
#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided, X)
for(p = 0 ; p < K ; p++) {DOSOMETHING;}
#pragma omp parallel for shared(P_pt, start_output_pt) schedule(dynamic, X)
for(p = 0 ; p < K ; p++) {DOSOMETHING;}
where X= K/m/num_threads, m = from2to 10;
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
as i see, you used guided scheduling with default chunk size = 1, that means there are K simultaious tasks for 8 cores. Each task will be sheduled separately in order requested by threads. Scheduling is overhead which degradate your parallelization ratio.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
float * P_pt = (float *) malloc(K*4*sizeof(float));
(K is typically 1200-2000).
The operations I do are basically a form of histograms in a spatial neighborhood.
The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
lets try to clarify the rule of assigning of iterations in the guided scheduling:
P1 = C*K/(num_threads - 1);
P2 = C*(K-P1)/(num_threads - 1);
P3 = C*(K-P1-P2)/(num_threads - 1);
...
Pn = C*(K-P(n-1)...-P1)/(num_threads - 1);
... while Pn > chunksize
P(n+1) = chunksize
...
P(n+m-1) = chunksize
P(n+m) = rest iters <= chunksize
where C <= 1
in case default chunksize=1
for K=1200, C=0.5 we have about 81 schedules, 14,6 iterations in average
for K=2000, C=0.5 we have about 89 schedules, 22,5 iterations in average -- more work per thread, that's because you have better ratio for K=2000
in case chunksize=20
for K=1200, C=0.5 we have about34 schedules, 35,3 iterations in average
for K=2000, C=0.5 we have about41 schedules,48,8 iterations in average
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
pay attentionto false sharing issues
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I tried the dynamic scheduling with different chunk sizes. What I saw was a decrease of CPU time spent in libiomp5.so, which is good - the scheduling overhead was reduced, however, the runtime for my code has not changed, speedup is not up to spec.
As far as I understand, false sharing happens when writes invalidates the cache line. I am not sure this is the case here, since increasing the number of points (which means more for-loop iterations) led to the expected speedup. Moreover, the VTune params I measured do not seem to indicate to such a problem. Is there a hardware event in VTune that will measure such cases ?
Thanks again, I appreciate the effort.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
How about guided scheduling? I'm interested in results))
>>Is there a hardware event in VTune that will measure such cases ?
I don't know.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Ideally you want to partition the work evenly with as few as thread scheduling interactions occuring. When your neighborhood sizes are relatively large (and of various sizes) then you might want to use a small chunk size (of number ofneighborhoods). When the size of the neighborhood is relatively small, then a larger chunk size would be in order. Therefore, consider partitioning your work in groups by size of neighborhood. Then use varying chunk size inversely proportional to the relative neighborhood size.
large neighborhood sized... chunk = 1
medium neighborhood sized... chunk = n ! n tbd
small neighborhood sized... chunk =m !m tbd
As to if you use 2, 3, ..., x different sized groupings, this would depend on your application.
Jim Dempsey
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I think I will try to implement Jim Dempsey's suggestion, it looks like it may help with reducing scheduling overhead.
It still doesn't solve the main problem - my main functions seem to work better when the workload is higher (say, number of iterations, or neighborhood size inside the function), but it is a start.
Is there a recommended book that deals with such issues ?
Thanks.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
sorry, i have no any idea while i know nothing about calculations inside iterations ((
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Just one more thing, you mentioned before that parallelization will work well if I have more than 100,000 instructions. Do you know what is the reason for degradation for a lower number of instructions ?
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
TBB documentation suggests at least 100K clock cycle. I just "a little bit" increased to instructions.
see "Tutorial", search for"Packaging Overhead Versus Grainsize"
http://threadingbuildingblocks.org/documentation.php
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
You could look at the asm output file from your compiler and get an idea about the function calls implied by your OpenMP usage. Details under Intel OpenMP are intentionally hidden; with gnu compilers, you should be able to capture the full source code with expansion of OpenMP directives.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
The E5420 is a 4 core processor (no HT). Does this mean you have two such processors?
------------
>>This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.
Does this mean your code is somewhat along the line of
for(image=0; image .lt. nImages; ++image)
{
ReadImage(image);
#pragma omp parallel for
for(point=0; point .lt. nPoints; ++point)
processPoint(point);
}
IOW the app is a high frequency of:
serial (with I/O)
parallel
serial (with I/O)
parallel
...
If so, then I suggest you rewrite the code to use a parallel pipeline technique.
The QuickThread threading toolkit (which I produce by the way) excells at parallel pipelines containing a mixture ofI/O and compute class pipes. We can discuss parallel further if you think this would be benificial to your application.
Jim
- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite