Quote:dmitry-prohorov (Intel)

luca_l_ · ‎05-04-2017

I'm trying to parallelyzing my application using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library.

These are the Google Drive links to my VTune results:

c755823 basic KMP_BLOCKTIME=0 30 runs : basic hotspot with environment variable KMP_BLOCKTIME set to 0 on 30 runs of the same input

c755823 basic 30 runs : same as above, but with default KMP_BLOCKTIME=200

c755823 advanced KMP_BLOCKTIME=0 30 runs : same as first, but advanced hotspot

On my Intel i7-4700MQ the actual wall-clock time of the application on average on 10 runs is around 0.73 seconds. I compile the code with icpc 2017 update 3 with the following compiler flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions    
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

In addition I set KMP_BLOCKTIME=0 because the default value (200) was generating an huge overhead.

We can divide the code in 3 parallel regions (wrapped in only one #pragma parallel for efficiency) and a previous serial one, which is around 25% of the algorithm (and it can't be parallelized).

I'll try to describe them (or you can skip to the code structure directly):

We create a parallel region in order to avoid the overhead to create a new parallel region. The final result is to populate the rows of a matrix obejct, cv::Mat descriptor. We have 3 shared std::vector objects: (a) blurs which is a chain of blurs (not parallelizable) using GuassianBlur by OpenCV (which uses the IPP implementation of guassian blurs) (b) hessResps (size known, say 32) (c) findAffineShapeArgs (unkown size, but in order of thousands of elements, say 2.3k) (d) cv::Mat descriptors (unkown size, final result). In the serial part, we populate `blurs, which is a read only vector.
In the first parallel region,hessResps is populated using blurs without any synchronization mechanism.
In the second parallel region findLevelKeypoints is populated using hessResps as read only. Since findAffineShapeArgs size is unkown, we need a local vector localfindAffineShapeArgs which will be appended to findAffineShapeArgs in the next step
Since findAffineShapeArgs is shared and its size is unkown, we need a criticalsection where each localfindAffineShapeArgs is appended to it.
In the third parallel region, each findAffineShapeArgs is used to generate the rows of the final cv::Mat descriptor. Again, since descriptors is shared, we need a local version cv::Mat localDescriptors.
A final critical section push_back each localDescriptors to descriptors. Notice that this is extremely fast since cv::Mat is "kinda" of a smart pointer, so we push_backpointers.

This is the code structure:

cv::Mat descriptors;
std::vector<Mat> blurs(blursSize);
std::vector<Mat> hessResps(32);
std::vector<FindAffineShapeArgs> findAffineShapeArgs;//we don't know its tsize in advance

#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
    for (int j = 1; j <= scaleCycles; j++)
    {
       hessResps[/**/] = hessianResponse(/*...*/);
    }

std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
    for (int j = 2; j < scaleCycles; j++){
    findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}

#pragma omp critical{
    findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}

#pragma omp barrier
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
  findAffineShape(findAffineShapeArgs[i]);
}

#pragma omp critical{
  for(size_t i=0; i<localRes.size(); i++)
    descriptors.push_back(localRes[i].descriptor);
}
}

At the end of the question, you can find FindAffineShapeArgs.

I'm using Intel Amplifier to see hotspots and evaluate my application.

The OpenMP Potential Gain analsysis says that the Potential Gain if there would be perfect load balancing would be 5.8%, so we can say that the workload is balanced between different CPUs.

This i the CPU usage histogram for the OpenMP region (remember that this is the result of 10 consecutive runs):

So as you can see, the Average CPU Usage is 7 cores, which is good.

This OpenMP Region Duration Histogram shows that in these 10 runs the parallel region is executed always with the same time (with a spread around 4 milliseconds):

This is the Caller/Calee tab:

For you knowledge:

interpolate is called in the last parallel region
l9_ownFilter* functions are all called in the last parallel region
samplePatch is called in the last parallel region.
hessianResponse is called in the second parallel region

Now, my first question is: how should I interpret the data above? As you can see, in many of the functions half of the time the "Effective Time by Utilization` is "ok", which would probably become "Poor" with more cores (for example on a KNL machine, where I'll test the application next).

Finally, this is the Wait and Lock analysis result:

Now, this is the first weird thing: line 276 Join Barrier (which corresponds to the most expensive wait object) is#pragma omp parallel`, so the beginning of the parallel region. So it seems that someone spawned threads before. Am I wrong? In addition, the wait time is longer than the program itself (0.827s vs 1.253s of the Join Barrier that I'm talking about)! But maybe that refers to the waiting of all threads (and not wall-clock time, which is clearly impossible since it's longer than the program itself).

Then, the Explicit Barrier at line 312 is #pragma omp barrier of the code above, and its duration is 0.183s.

Looking at the Caller/Callee tab:

As you can see, most of wait time is poor, so it refers to one thread. But I'm sure that I'm understanding this. My second question is: can we interpret this as "all the threads are waiting just for one thread who is staying behind?".

FindAffineShapeArgs definition:

struct FindAffineShapeArgs
{
    FindAffineShapeArgs(float x, float y, float s, float pixelDistance, float type, float response, const Wrapper &wrapper) :
        x(x), y(y), s(s), pixelDistance(pixelDistance), type(type), response(response), wrapper(std::cref(wrapper)) {}

    float x, y, s;
    float pixelDistance, type, response;
    std::reference_wrapper<Wrapper const> wrapper;
};

Dmitry_P_Intel1 · ‎05-05-2017

Hello Luca,

Could you please publish a screenshot of "Top 5 Parallel Regions by Potential Gain" from the summary view. And then press to any region name there that should lead you to a grid view with "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping. Expand Potential Gain column to see the reasons of the gain, expand openmp region rows with significant potential gain time and see metrics by barrier-to-barrier segments and publish the screenshot of grid view as well. With this you will see the normalize imbalance, lock and scheduling overhead cost per such segment.

Also it would be good to increase the workload or choose advanced hotspots with less sampling granularity to make the results more statistically representative.

BTW - you wrote that default KMP_BLOCKTIME generated huge overhead - did you compare original application runs (without profiling) with default KMP_BLOCKTIME and KMP_BLOCKTIME=0? Usually forcing library to go to slip on each synch can lead to additional overhead.

Thanks & Regards, Dmitry

luca_l_ · ‎05-05-2017

dmitry-prohorov (Intel) wrote:

Hello Luca,

Could you please publish a screenshot of "Top 5 Parallel Regions by Potential Gain" from the summary view. And then press to any region name there that should lead you to a grid view with "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping. Expand Potential Gain column to see the reasons of the gain, expand openmp region rows with significant potential gain time and see metrics by barrier-to-barrier segments and publish the screenshot of grid view as well. With this you will see the normalize imbalance, lock and scheduling overhead cost per such segment.

Also it would be good to increase the workload or choose advanced hotspots with less sampling granularity to make the results more statistically representative.

BTW - you wrote that default KMP_BLOCKTIME generated huge overhead - did you compare original application runs (without profiling) with default KMP_BLOCKTIME and KMP_BLOCKTIME=0? Usually forcing library to go to slip on each synch can lead to additional overhead.

Thanks & Regards, Dmitry

Hello Dimitry,

First of all, thanks for your answer, I really appreciate that.

As you can see from the code structure that I've published in my question, there is only one parallel region with 4 parallel for and 2 critical sections. For this reason, the "Top 5 Parallel Regions by Potential Gain" in the summary view shows only one region (the only one).

Here's a screenshot:

As you can see from the screenshot above, the Potentential Gain is very low (only 5.8%), so I guess that this could not be meaningful. Please, correct me if I'm wrong.

Look at the "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping, this is the order of the most expensives loops:

- The 3th loop:

#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
  findAffineShape(findAffineShapeArgs);
}

is the most expensive one (as I already knew) and here's a screenshot of the expended view:

As you can see, many functions are from OpenCV, which exploits IPP and is (should be) already optimized. Expanding the two other functions (interpolate and samplePatch) shows a [No call stack information]. Same for all the other functions (in other regions too).

The 2nd most expensive region is the second parallel for:

#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
    for (int j = 2; j < scaleCycles; j++){
    findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}

Here's the expanded view:

And finally the 3th most expensive is the first loop:

#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
    for (int j = 1; j <= scaleCycles; j++)
    {
       hessResps[/**/] = hessianResponse(/*...*/);
    }

Here's the expended view:

I'll run an Advanced Hotspot and also a bigger workload, hoping to see more accurate results (I'll post it in my next comment).

If by "original application" you mean the original code written by the original author, it was not parallelized, I'm trying to do the job. Using the defautl KMP_BLOCKTIME resulted in longer execution, as I said in my question.

I would like to post directly the VTune files so you can see all the information that you like, but I can't attach them. If you have any solution, I'll gladly accept it.

For anything else, just ask.

TimP · ‎05-08-2017

If IPP uses the TBB library, that would be a reason for needing to set BLOCK TIME=0. I don't understand whether Dmitry refers to transition between TBB and OpenMP or to the extra overhead within OpenMP when blocktime is too small. Sometimes the problem may be alleviated if hyperthreading is active by limiting TBB and OpenMP each to 1 thread per core with OMP_PLACES=cores so that OpenMP may block but still permit TBB some use of each core. Then it may be difficult under VTune to see which apparent OpenMP overhead is actually blocking progress of the application.

luca_l_ · ‎05-08-2017

Tim P. wrote:

If IPP uses the TBB library, that would be a reason for needing to set BLOCK TIME=0. I don't understand whether Dmitry refers to transition between TBB and OpenMP or to the extra overhead within OpenMP when blocktime is too small. Sometimes the problem may be alleviated if hyperthreading is active by limiting TBB and OpenMP each to 1 thread per core with OMP_PLACES=cores so that OpenMP may block but still permit TBB some use of each core. Then it may be difficult under VTune to see which apparent OpenMP overhead is actually blocking progress of the application.

Hello Tim,

Thanks for your time and help! I don't know if IPP uses TBB internally, I installed it along with Parallel Studio 2017 (don't remember which update). The only thing that I can do is to build OpenCV by disabling both TBB and OpenMP (leaving pthreads only), but this is probably isn't going to help much, since parallelism in OpenCV means usually simple parallel fors. Btw, I don't know OpenMP affinity, but I'm already using Thread Affinity using 2 threads per core with KMP_AFFINITY=compact and KMP_HW_SUBSETS=4c,2t on my Intel i7 core, while one thread per core on with KMP_HW_SUBSETS=64c,1t on the Intel KNL.

If you're interested, I updated the question including a link to download the VTune results (if you want to play with them) and I can send you the original code in private (just let me know in that case).

How should I interpreter these VTune results?