Hello,

YW · ‎12-29-2014

Hi,

I am rewriting part of LibSVM code for vectorization in order to accelerate it on Xeon Phi. Unfortunately, the performance of my rewriting code is the same as, if not slightly worse than, the original performance on Xeon Phi. To figure out the reason, I profiled two programs using vTune hot spot analysis. According to the results, the target code segment takes significantly less time in the rewriting version, however, most of its time goes to __kmp_wait_yield_4, which only occupies a tiny amount of time in the original version.

Does anyone know what it means? I tried to Google it but got very little information.

BTW, my vectorization code gets about 20% improvement in total if running in the host.

Thanks in advance!

Dmitry_P_Intel1 · ‎12-30-2014

Hello,

The function you mentioned is from OpenMP runtime and it means that your application uses OpenMP and you have spinning (busy wait) of OpenMP runtime on some synchronization - either user synchronization on openmp critical sections or thread synchronization using ragma "#omp ordered".

It is difficult to understand the reason why it becomes worse not seeing the code. It would be helpful if you could send the code snippet or provide the VTune result for the vectorized case (I will contact you sending the message directly on the opportunity).

BTW: starting from VTune Amplifier XE 2015 Update 1 if you use /OpenMP Region/... groupings in bottom up grid you can see CPU Spin and Overhead time classification per region.

Thanks & Regards, Dmitry

YW · ‎12-30-2014

dmitry-prohorov (Intel) wrote:

Hello,

The function you mentioned is from OpenMP runtime and it means that your application uses OpenMP and you have spinning (busy wait) of OpenMP runtime on some synchronization - either user synchronization on openmp critical sections or thread synchronization using ragma "#omp ordered".

It is difficult to understand the reason why it becomes worse not seeing the code. It would be helpful if you could send the code snippet or provide the VTune result for the vectorized case (I will contact you sending the message directly on the opportunity).

BTW: starting from VTune Amplifier XE 2015 Update 1 if you use /OpenMP Region/... groupings in bottom up grid you can see CPU Spin and Overhead time classification per region.

Thanks & Regards, Dmitry

Thanks for your reply! I didn't use any "#omp ordered" thing. I understand that __kmp_wait_sleep is about busy waiting.Does __kmp_wait_yield_4 indicate the same thing?

James_C_Intel2 · ‎12-30-2014

I understand that __kmp_wait_sleep is about busy waiting.Does __kmp_wait_yield_4 indicate the same thing?

Both of these are places where the runtime is waiting. Your problem is very likely to be that you have load imbalance or, (which can be considered a really bad case of imbalance) just insufficient parallel work.

If you're running with 240 threads, then clearly you need at least 240 parallel chunks of work (and, ideally something nearer 2,400 if you can't ensure that the number of pieces of work is divisible by the number of threads). Since you're also vectorising, that means the effective number of chunks of work from a loop is divided by 16 (for SP) or 8 (for DP), therefore you need 16x more iterations, so that gives you 16x2400 = 38,400 iterations for a SP loop, and 19,200 for a DP one.

YW · ‎12-30-2014

James Cownie (Intel) wrote:

I understand that __kmp_wait_sleep is about busy waiting.Does __kmp_wait_yield_4 indicate the same thing?

Both of these are places where the runtime is waiting. Your problem is very likely to be that you have load imbalance or, (which can be considered a really bad case of imbalance) just insufficient parallel work.

If you're running with 240 threads, then clearly you need at least 240 parallel chunks of work (and, ideally something nearer 2,400 if you can't ensure that the number of pieces of work is divisible by the number of threads). Since you're also vectorising, that means the effective number of chunks of work from a loop is divided by 16 (for SP) or 8 (for DP), therefore you need 16x more iterations, so that gives you 16x2400 = 38,400 iterations for a SP loop, and 19,200 for a DP one.

It's expected to have some __kmp_wait_sleep in my program because I indeed don't have sufficient parallel work due to on-board memory limitation(8 GB). But it is the first time for me to observe soaring __kmp_wait_yield_4 in vTune after I did vectorization for LibSVM, numerically, from 22s to 906s. I vectorized other components of my code (e.g. dot product) but never saw so much __kmp_wait_yield_4 before.

YW · ‎12-31-2014

I think I kind of figured out what happened. The fact is, if I decrease the number of threads to be 60, i.e. one thread per code (I am using 5110P, BTW), I can get speedup after vectorization and __kmp_wait_yield_4 won't stand out in vTune. I think the reason is, due to the unpredictable iterations in SVM solver, the compiler cannot schedule the use of VPU efficiently therefore contention happens which causes a lot of busy waiting, appeared as __kmp_wait_yield_4. If I limit one thread per core, the VPU can be used exclusively so that we see expected speedup.

jimdempseyatthecove · ‎01-02-2015

Your end goal is not to reduce __kmp_wait_yield_4 to 0, rather it is to reduce the runtime of the application. The amount of time spent in __kmp_wait_yield_4 is an indicator of load imbalance.

Load imbalance is due to several factors, not simply due to being given a different amount of work. The __kmp_wait_yield_4 time (each thread) is an indicator of the thread completion skew of a parallel region. The skew is affected by:

un-equal work
skew in thread start time
differing degree of cache hit ratios
false sharing evictions
turbo-boost or lack thereof for core
un-equal latencies through critical sections
... other...

Additionally, the skew is not always consistent, per run or major iteration. When thread completion skew is suspected I often find it beneficial to instrument the code using RDTSC to determine the completion time of each thread, build up statistics over several runs, then see if a pattern shows up. Using the skew pattern, when consistent, can be used to infer how you can re-balance loads, such that the otherwise wasted __kmp_wait_yield_4 time can be put to productive use. You might find this blog post of interest: https://software.intel.com/en-us/blogs/2014/02/22/the-chronicles-of-phi-part-5-plesiochronous-phasing-barrier-tiled-ht3

I suggest you read the earlier parts too. While your problem might not be addressable by the final solution in that article, the article addresses issues of how to detect the per-thread completion skew, and then this may be insightful to you as to how you can recover lost computational capability.

Jim Dempsey

YW · ‎01-02-2015

Hi Jim,

Thanks for your suggestions and your blog posts. They are all very helpful and insightful. Back to my case, there are not so many parallel chunks due to the on-board memory limitation. Typically I only have 120 parallel chunks, and based on the experiences the performance gaining from 120 to 240 threads are relatively marginal so it's OK with me to leave some threads for busy waiting. Because of this, your plesiochronous technique is hard to be applied here (do you agree?)

Also, can you comment on my post at #6 https://software.intel.com/en-us/forums/topic/537768#comment-1809508?

jimdempseyatthecove · ‎01-02-2015

I am not suggesting the plesiochronous phasing barrier is suitable for your application, instead, what I am suggesting is the relatively large __kmp_wait_yield_4 time may (stressed) be indicative of lost opportunity for computation. I say "may" because the yield time can also be observed between parallel regions (during KMP_BLOCKTIME interval). It is hard to determine from the VTune reports as to if the yield time is recoverable. You can find this out for yourself by instrumenting your code by something like changing:

#pragma omp parallel for ...
for(...) {
...
}

into

#pragma omp parallel
{
#pragma omp for ...
for(...) {
...
}
// still in parallel region, but after do work
finishTime[omp_get_thread_num()] = __rdtsc(); // mark time we exit parallel region
} // end of parallel region
// accumulate skew

If there is substantial completion skew amongst threads, then investigate why, and fix it

As to your #6

One of the problems is your program may be encountering (hard to say without seeing the code), is as Cownie suggest is your loop (slice) count may be getting too low as you add more threads. What also may be at issue is the number of threads used, and loop partitioning, do not lend itself to having each thread (except possibly last thread) having whole numbers of vectors to processes. Are you using the OpenMP SIMD directive/construct? You may also need to specify an optimal chunk size to a schedule clause. IOW each chunk is a multiple of vector sizes .AND. at least a minimal iteration count. On smaller problem size the larger multiple of vector chunk, will indirectly reduce the thread count. Experiment with dynamic scheduling, as this may avoid waiting for a thread that had been preempted by something else.

Jim Dempsey

YW · ‎01-02-2015

Hi Jim,

Thanks for your explanation. I used __rdtsc() function as you suggested and then printed the results using printf("%d\n", *std::max_element(ft, ft+NUM_THREADS_USED)-*std::min_element(ft, ft+NUM_THREADS_USED));

If NUM_THREADS_USED=60 (one thread per core, the faster version), the average result is about 200,000, if NUM_THREADS_USED=120 (two trheads per core, the slower version in which I guess VPU contention happens), the results varies a lot , from 10,000 to 150,000) across multiple runs, although the running time is constantly slower.

How do you interpret the results? It looks weird to me that my faster version has more completion skew.

jimdempseyatthecove · ‎01-02-2015

There are various numbers that are of interest for analysis. Shown in my code sketch was capturing the rdtsc of the completion. You would like to also capture the rdtsc of the start (between omp parallel and omp for), also of interest is the main thread rdtsc prior to parallel region. Main thread following parallel region might be of interest too.

With these numbers you can determine

total time
thread start skew time
thread compute time
thread completion skew time
master thread wait time between it completing its compute time and last thread to finish time.

Going from 60 threads, 1 thread per core with 200,000 ticks per core
Perfect scaling would produce 120 threads, 2 threads per core at ~100,000 ticks per thread.
Of course, you won't have perfect scaling, but on Phi 2 threads per core can get close, depending on organization of program.

Seeing 10,000 to 150,000, strongly indicates that the loads are not in balance, and had they been balanced, the number would have been significantly (?? ) better than 150,000.

What your task now is to determine why some threads take 150,000, and shove some of their work into those taking 10,000 ticks.

Note, the 10,000 tick threads may be those partnered with the other thread of same core and fortuitously improving the cache hit ratio (alternately) of each other. Assuming the number of statements executed by each thread were ~equal, and with a runtime difference of 15x between threads, there is a strong motivation to find out why there is this difference, and exploit it.

I think you are on your way to finding your better solution. Once you get this figured out for 120 threads, re-do it for 180 threads.

Jim Dempsey

YW · ‎01-02-2015

Sorry, I might not be clear in my #10 post. When I mentioned 10,000 to 150,000, I meant to *differrence* between maximum and minimum ticks in a particular run varies from 10,000 to 150,000, i.e. in some run the threads appear to be balanced, but in another run the threads appear to be significantly imbalanced. Looks like you are understanding that as, some of my threads take 10,000 ticks while some take 150,000 ticks.

YW · ‎01-02-2015

Also, I measured the number of ticks before and after "parallel omp for" and found that, when NUM_THREADS_USED=60, the average ticks per thread is about 1,400,000,000, when NUM_THREADS_USED=120, the average ticks largely increase to 3,500,000,000, although more balanced in terms of tick difference between maximal- and minimal-tick threads.

So there must be a lot of additional things happened when more than one thread uses the same VPU in my case...

jimdempseyatthecove · ‎01-03-2015

>>10,000 to 150,000, I meant to *differrence* between maximum and minimum

Then the runtimes are obverse of the above differences, but relativisticly the same. IOW a disparity of work difference of 140,000 (though not uniformly distributed).

Can you show a code sketch of your loops, including the OpenMP clauses, how you specify the KMP_AFFINITY or OMP_PLACE_THREADS and comments as to limits in your for and nature of computation?

#13 comments

The increased time (~x2) is likely due to cache line evictions. With 120 threads, presumably 2 threads per core, it is beneficial to have them work in concert instead of in competition (cache-wise). Your code should be written to know your affinity association (thread to core) and then take advantage of the shared L1 and L2 (getting most re-use of data fetched from RAM or remote L2).

Jim Dempsey

YW · ‎01-05-2015

It's part of LibSVM code (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). I use one thread to run one classifier, and the hot spot is as following

for(int j=0;j<active_size;j++)
	{
    double grad_diff=Gmax+G*y;
		if(y==+1)
		{
			if (!is_lower_bound(j))
			{
				if (G >= Gmax2)
					Gmax2 = G;
				if (grad_diff > 0)
				{
                    assert(Q_i);
					double quad_coef = QD+QD-2.0*y*Q_i;
					double obj_diff = quad_coef>0?-(grad_diff*grad_diff)/quad_coef:-(grad_diff*grad_diff)/TAU;

					if (obj_diff <= obj_diff_min)
					{
						Gmin_idx=j;
						obj_diff_min = obj_diff;
					}
				}
			}
		}
		else
		{
			if (!is_upper_bound(j))
			{
				if (-G >= Gmax2)
					Gmax2 = -G;
				if (grad_diff > 0)
				{
                    assert(Q_i);
					double quad_coef = QD+QD+2.0*y*Q_i;
					double obj_diff = quad_coef>0?-(grad_diff*grad_diff)/quad_coef:-(grad_diff*grad_diff)/TAU;

					if (obj_diff <= obj_diff_min)
					{
						Gmin_idx=j;
						obj_diff_min = obj_diff;
					}
				}
			}
		}
	}

This is scalar code, so I rewrote it to be

  double grad_diff[active_size];
  double obj_diff[active_size];
  #pragma loop count(200)
  for(int j=0;j<active_size;j++)
  {
    grad_diff=Gmax+G*y;
    double quad_coef = QD+QD-2.0*y*Q_i;
    obj_diff = quad_coef>0?-(grad_diff*grad_diff)/quad_coef:-(grad_diff*grad_diff)/TAU;
  }
  for(int j=0;j<active_size;j++)
	{
		if(y==+1)
		{
			if (!is_lower_bound(j))
			{
				if (G >= Gmax2)
					Gmax2 = G;
				if (grad_diff > 0)
				{
					if (obj_diff <= obj_diff_min)
					{
						Gmin_idx=j;
						obj_diff_min = obj_diff;
					}
				}
			}
		}
		else
		{
			if (!is_upper_bound(j))
			{
				if (-G >= Gmax2)
					Gmax2 = -G;
				if (grad_diff > 0)
				{
					if (obj_diff <= obj_diff_min)
					{
						Gmin_idx=j;
						obj_diff_min = obj_diff;
					}
				}
			}
		}
	}

KMP_AFFINITY=scatter.

YW · ‎01-05-2015

I ran general exploration of vTune to profile my 120-thread program, but didn't see significant cache misses in the hot spot function. Also, when running the scalar code, 120 threads works better than 60 threads, which indicates that no cache contention happens in the scalar code.

jimdempseyatthecove · ‎01-05-2015

Where do you place the parallelization?

Presumably on an outer for(i=... loop.

Can you show the is_lower_bound and is_upper_bound functions?

Is y[] an integer array? with say values of 1 and 0? or 1 and -1?

If so, consider making y[] an array of double holding 1.0 or -1.0

Note, if the bounds are always at j==0 and j==active_size-1 then change your loop instead.

If not, then change your is_lower_bound and is_upper_bound into a single expression that generates a 0.0 if either is true.

Then use the following for your compute intensive loop:

for(int j=0;j<active_size;j++)
{
  if (!is_lower_or_upper_bound(j)) // place expression here
  {
    if (G*y >= Gmax2)
      Gmax2 = G*y;
    if (grad_diff > 0.0)
    {
      if (obj_diff <= obj_diff_min)
      {
        Gmin_idx=j;
        obj_diff_min = obj_diff;
      }
    }
  }
}

Jim Dempsey

YW · ‎01-05-2015

Thanks for your quick comments! I will try accordingly.

y[] is a char array contains only +1 or -1. One quick question, why do you think y[] is better to be double?

jimdempseyatthecove · ‎01-05-2015

An alternative to the above is to consider having y[] be a double containing -1.0, 0.0 and 1.0. When at upper or lower bounds it contains 0.0.

With this setup, the bounds if test can be removed as G*y will produce 0.0
Note, you may need to adjust grad_dif too, or use if (grad_diff*abs(y) > 0.0)

Jim Dempsey

daflippers · ‎01-05-2015

My guess would be that it is more efficient to have all arrays G[] and y[] the same type, i.e. double in this case.

David

Ok so it wasn't that - thanks for the explanation Jim.

YW · ‎01-05-2015

Hi Jim,

You suggestion doesn't really work in my case. Although I can combine lower bound and upper bound test into one expression

  bool is_not_upper_or_lower_bound(int i)
  { 
    return (y==-1 && alpha_status != UPPER_BOUND) || (y==1 && alpha_status != LOWER_BOUND);
  }

I still get a lot of __kmp_wait_yield_4 in my program and the running time doesn't change, if not slightly worse, when using 120 threads. I think the second loop in my code cannot be vectorized any more, correct?

What does __kmp_wait_yield_4 imply in hot spot analysis of vTune