OpenMP spinning time

Ashkan_T_ · ‎04-27-2015

Hello,

I am using a simple Merge Sort benchmark on the Xeon Phi. 78% of the total CPU time is consumed by "libiomp5.so"

I tried to reduce the watsed time by the OpenMP runtime library by setting the "export KMP_BLOCKTIME=0". Please note that the application is running natively on the MIC. I have also tried "export OMP_WAIT_POLICY=passive". No effect!

Why this does not have any effect on the execution time or the wasted CPU time?

Thank you.

jimdempseyatthecove · ‎04-27-2015

As you complete merge phases, you reduce the number of merge streams. At some point the number of streams is less than the number of worker threads.

You could display the number of merge streams on each major iteration.

Also, are you performing an internal sort (fits in memory) or external sort (from disk)?

Jim Dempsey

TimP · ‎04-27-2015

It would seem a rather poor algorithm if you expect most of the OpenMP time to be spent in BLOCKTIME. Did you determine that is the case?

Perhaps you are using too many threads for the data set size.

Ashkan_T_ · ‎04-27-2015

Hello Jim,

Thank you for the answer. I have implemented the same algorithm with TBB. The runtime library overhead in that case was less than 5%. I am not looking at the running time. I just want to see if I can reduce the overhead of OpenMP runtime library by removing the spinning.

To answer your question, it is an internal sort. I could get a total CPU time of ~250s with TBB, while this number is ~1700 for OpenMP.

What else than spinning could cause this difference for such a simple benchmark?

---------

Dear Tim,

It actually is a poor algorithm for scaling, but the data size is not small for the number threads. 80Million integers. It takes about 7 seconds on the Xeon Phi with 240 threads. Question is why I cannot reduce the total CPU time (not the running time) by changing the BLOCKTime?

jimdempseyatthecove · ‎04-27-2015

Generally setting block time to 0 (or reducing) is detrimental to the parallel application, while being beneficial to other applications on your system. If you are observing excessive block time numbers (together with poor performance), then your workload partitioning is not balanced. Can you post your parallel loops?

Jim Dempsey

James_C_Intel2 · ‎04-28-2015

As Jim D is implicitly pointing out, your objective (reducing time in the OpenMP runtime while not being concerned with application performance) seems rather perverse. (If that is really all you want to optimize, then you could compile without OpenMP; that will certainly eliminate all OpenMP runtime overhead!).

More seriously, as others have pointed out, if threads are waiting, that is because they don't have work to do. That's an algorithmic problem which can't be fixed by changing settings in the OpenMP runtime, which affect how the threads wait, but not the fact that they have to wait.

An up-to-date VTune can show you load imbalance assigned to each OpenMP parallel region, which will let you see where the problems are. This is described in this article.

Here is the kind of information it generates, note that it shows serial time, the parallel time, and the load imbalance.

Ashkan_T_ · ‎04-28-2015

Hi James,

I am concerned about the performance. But what I am trying to do is to find the effect of different parameters. As I said, the difference between TBB and OpenMP shows that the algorithm can reach the same performance without the extra overhead OpenMP has (the wasting CPU cycles). This is not an algorithmic problem. If they have to wait they should wait without spinning (again TBB total CPU cycles: ~250, OpenMP: ~1700)

I should definitely get the newer version of VTune to see what exactly is happening in the system, however, I am still puzzled why changing the BLOCKTIME does not have any effects on the total CPU time.

@Jim,

This is the code I use. cut is the cutoff point. After that, the computation is serial. The whole paper and the comparison with TBB and Cilk Plus is here:

http://goo.gl/ZOKmAJ

void mergesort_parallel_omp (int* input, int* tmp, int size, int cut) {
  int half = size/2;
  int *A, *B, *tmpA, *tmpB;
  if (cut==1) {   
        mergesort_serial(input, tmp, size);
        return;
   }

   A = input;
   tmpA = tmp;
   B = A + half;
   tmpB = tmpA + half;

#pragma omp task
   {
        mergesort_parallel_omp(A, tmpA, half, cut/2);
   }

#pragma omp task
   {
        mergesort_parallel_omp(B, tmpB, size-half, cut/2);
   }

#pragma omp taskwait

#pragma omp task /
   {
         merge(A, B, size, tmp);
   }
#pragma omp taskwait
}

James_C_Intel2 · ‎04-28-2015

Are you completely confident that the changes you are making to the envirables are visible to the program?

I suggest you set KMP_SETTINGS=1 which will show you the settings seen by the OpenMP runtime. (And, if it doesn't, that shows that you're not propagating the envirables :-)).

If they have to wait they should wait without spinning

Why? Assuming that you're not trying to time-slice KNC hardware threads (which is a generally bad idea), then the hardware resources are allocated to your process from start to finish, and how the runtime chooses to use them when you can't seems a minor detail. The choice of spinning is made because it helps many codes to perform better, by reducing the time needed to fork. (The one reasonable argument here would be that not-spinning may reduce energy consumption; but if that's really true, then you have some serious load-imbalance problems anyway!)

Once you're executing parallel programs the interesting metric is time to solution, not CPU used. (Because your aim is to change a 1 CPU for n Seconds run into an M CPUs for n/M seconds run, (achieving perfect scaling), which still consumes M*n/M == n cpu seconds. As soon as your parallel efficiency drops below 100%, you are guaranteed that the parallel run will consume more CPU time than the serial one, but we accept that because we want the solution sooner).

jimdempseyatthecove · ‎04-28-2015

Ashkan,

You have apples and oranges comparisons. In your .PDF paper, page 6, MergeSort chart (a), the average speedup for OpenMP is greater than that for Cilk Plus and TBB. Yet your metric (the one you prefer to use) is the reported CPU time. Your chart (b), chose the 240 thread scenario instead of the optimal 60 or 120 scenario. Also note that you are observing between a 9x and 12x speed-up. You should see at least 5x this with algorithmic changes.

On page 8, MatMul, OpenMP also shows faster speedup, yet your focus again is in CPU time. Looking at the 2048 numbers for (b) and (c) we see for CPU time / speedup approximately 800/52 (15.38) for OpenMP and 625/40 (15.63) for TBB. Not much different.

On Page 10, turnaround time with multiple applications running concurrently, for this test you would want to use KMP_BLOCKTIME=0 (such that you relieve any wait time to other applications .AND. you would choose 120 threads for Cilk Plus and OpenMP. There is little sense in not using your learning experience from the single application at a time, and apply it to multiple applications at a time.

Jim Dempsey

Ashkan_T_ · ‎04-28-2015

@James: Thanks for your answer. Yes. I have checked it once again:

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ABORT_IF_NO_IRML=false
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=960
   KMP_ALL_THREADS=2147483647
   KMP_ASAT_DEC=1
   KMP_ASAT_FAVOR=0
   KMP_ASAT_INC=4
   KMP_ASAT_INTERVAL=1000
   KMP_ASAT_TRIGGER=5000
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0

Obviously, OpenMP committee have decided to choose the best option (considering all the factors we are discussing here) as the default wait policy. I didn't mean that they "should wait without spinning". I mean that now that we to change the default, there should be a way to force them to go sleep in order to measure the performance in that case. What we can see here is that KMP_BLOCKTIME does not do the job.

As I said, I am not saying that the wasted CPU cycles are more important that "time to solution". I want to investigate the effect of total CPU time as another performance metric, which might or might not be very important in uniprogramming, but will become important for multiprogramming.

------------------------------------------------------------

@Jim,

First of all, thanks for your comments on the paper.

I should clarify the purpose of the paper. We are trying to look at different performance metrics. Speedup is one of them. CPU time is important (as we have shown). Cutoff is important. The paper does not say TBB is better than OpenMP. It only considers different factors. The MergeSort algorithm is the one that has been used in many other papers. It is also available in BOTS benchmark suite. The efficiency of the algorithm, however, is not our concern. Finding the effect of different factors is (Here we target parallelism in a general-purpose system. Although Xeon Phi is not designed for that, but it provides us with what we need: a shared-memory architecture with many cores)

Why the number of threads has not changed? Basically, the whole idea was to measure the performance of all approaches with the default configuration, which will be 240 threads on the Xeon Phi. That's why we designed a new programming approach. We achieved top performance amongst all for the same benchmarks with 240 threads (no need to tune this number). This simply means that the drop in performance we observe with other approaches is avoidable. So, it is not about the algorithm anymore. It is more about the runtime system. If you want to see our results: http://goo.gl/E3p1I5

It is not an apple-orange comparison. Quite the contrary, I think if we choose the best results and compare them together, the comparison is not fair. Why should we compare the total CPU time using for example 100 threads with OpenMP vs. 240 threads with TBB?

Without a cost model, how should the users find the optimal number of threads based on the programming model and the nature of the program itself (assuming the cases where tasks are large enough to cover the overhead of parallelization). Even if they do, what about the input size? If we do all the experiments based on a specific number of threads, what would happen if someone wants to change the data size? what if the OpenMP performance for the mergesort benchmark for 100M integers is better with 180 threads? For all these reasons, we decided to work with 240 threads (although we have been fair and showed the speedup as well).

Discussion about MatMul could be different, because it uses the traditional "omp for", rather "omp task"s. I agree that CPU time / speedup in this case is fairly similar, but for the MergeSort was not (in the case of 240 threads for example)

Most importantly, the purpose was not to achieve the best performance for multiprogramming, either. Instead of learning from the previous cases, we wanted to see the effect of previous factors in this new situation, and learn from all the cases. Moreover, there is no guarantee the if an application achieves the best speedup with X number of threads, same thing happens in presence of other applications (multiprogramming). right? So, in GENERAL cases, why should we compare different configurations (for example 100 threads for each of the 3 OpenMP programs with 240 threads for each of the 3 TBB programs), if we are not even sure that they would result in the best performance. Also (in a general-purpose system) no one can predict what the next state of the system would be (applications come and go randomly). So, the ideal is that the we start the programs with the default number of threads and rely on the OS and/or runtime systems to adapt based on the runtime observations. Isn't it?

I have also checked the OpenMP case with thread affinity. Thread migration was not wasting CPU cycles in this case.

All in all, I still think that there is a problem here. It seems that the spin-wait does not get disabled!

Thank you again :)

jimdempseyatthecove · ‎04-28-2015

In light of the Xeon Phi with 60 cores and 4 hardware threads per core, were optimal threads per core greatly depends on cacheability of the algorithm and problem size in relationship to the memory bandwidth, you cannot always use 4 threads per core and all cores under the assumption that this will give you meaningful results. This would be similar to evaluating a formula 1 race where the cars have one gear.

If you want your paper to be an academic exercise it is one thing, but it clearly does not present practical advice (other than by inference).

What is the entirety of the environment variable settings. I do not see the settings relating to number of threads, and/or if nested parallelism is enabled.

On your post #7, sample code, remove lines 26 and 30. There is no reason to issue a solitary task for the merge phase. It can be done inline.

Jim Dempsey

Ashkan_T_ · ‎04-29-2015

Thanks Jim,

"Optimal threads per core greatly depends on cacheability of the algorithm and problem size in relationship to the memory bandwidth". Absolutely.

But another point of view: we assume that every software thread is pinned to a hardware thread. So, the problems of task scheduling on threads and thread scheduling on cores become the problem of task scheduling on cores. That's how our programming model works. All the issues regarding the problem size and efficient chuncking for caching benefits remain the same, i.e. they depend on how efficiently one implements the algorithm.

OmpSs for example, proposed a similar idea to shift some overhead of dealing with task dependencies to the compile-time. Those features added to OpenMP 4.0, later on. That's to my mind the practical conclusion: there is some overhead around that could be reduced.

But what are the benefits? We pre-schedule the tasks on threads (hence on cores) based on their dependencies, and balance the load (only if needed) at runtime. We got the speedup of 13x with 240 threads with the total CPU time of less than 200 seconds (less energy consumption) with the same algorithm (including the lines 26 and 30). TBB e.g. cannot even go beyond 10x at any cases.

I will apply your tips to see if I get any new results. Probably, I should also install VTune 2015 ASAP to see what s happening inside the omp runtime system.

James_C_Intel2 · ‎04-30-2015

Obviously, OpenMP committee have decided to choose the best option (considering all the factors we are discussing here) as the default wait policy.

This has nothing much to do with the OpenMP committee. It's about implementation choices in the Intel OpenMP runtime. (You can tell, because the envirable doesn't start with OMP_)

I didn't mean that they "should wait without spinning".

OK, though that is exactly what you wrote in the post dated Tue, 04/28/2015 - 04:13, :-).

I mean that now that we want to change the default, there should be a way to force them to go sleep in order to measure the performance in that case. What we can see here is that KMP_BLOCKTIME does not do the job.

Without any code we can run it's impossible to know if that's true or not. We don't have any outstanding bug reports of KMP_BLOCKTIME=0 not working. (But maybe there's a bug related to waiting for tasks as against waiting for a new parallel region).

Probably, I should also install VTune 2015 ASAP to see what s happening inside the omp runtime system.

If you really want to know what's happening in the OpenMP runtime, you can read the source code, should that be of interest. (You can find it at either http://openmp.llvm.org or http://openmprtl.org )

jimdempseyatthecove · ‎04-30-2015

>>But another point of view: we assume that every software thread is pinned to a hardware thread. So, the problems of task scheduling on threads and thread scheduling on cores become the problem of task scheduling on cores. That's how our programming model works...

The basis for that could be founded on one software thread per core, but this is not the case. On MIC you can use anywhere from 1 to 4 threads within the core. These threads share the L1 and L2 cache. The more threads you use per core, the less effective capacity of each cache level you have to use. And this results in requiring finer granularity (per thread) of the parallel regions, and this in turn creates more barriers (region entry/exit points). Additionally, the threads within the core share the fetch/store resources. Therefore your task scheduling might benefit by having the split levels split by core, then by thread within core.

>>We got the speedup of 13x with 240 threads (OpenMP)... TBB e.g. cannot even go beyond 10x at any cases.

If you are looking at TCO (total cost of operation (ownership)), then you have to look at more than just the watts/bytes sorted. The time to complete affects many other cost concerns (salaries, heat/ac, GSA, opportunity, lost billing, ...). 30% is a lot to give up.

FWIW I seem to recall some posts on the TBB forum some time ago that there was a concern about the reported "spinwait" time in TBB... the corrective measure was to remove that information from the VTune report. As to if this is (still) the case or if the information is listed elsewhere I cannot say.

Jim Dempsey

jimdempseyatthecove · ‎04-30-2015

Can you post your complete MergeSort program? There may be some additional suggestions to be made.

Jim Dempsey