Software Archive
Read-only legacy content
17061 Discussions

How to find the reason for poor scalability?

Surya_Narayanan_N_
4,778 Views

Hello,

     I am running some multithreaded benchmark programs in Mic. Some programs don't scale beyond 32 threads and some beyond 64. I am trying to find out the reason why they are not scalable beyond certain number of threads. Definitely, the poor scaling is not a result of lack of computing resources (i,e we can run 244 hw threads without the problem of context switching).

I am trying to analyze this using Vtune but am still not sure how to study this issue. 

1.Vtune  Locks and waits analysis doesn't work in Knc (mic). So I don't know how to find whether the locks are the issues?

2. Bandwidth? As more threads are spawned and if they use lot of shared data, there can be an issue of cache coherence eating up the bandwidth which can be studied using core/uncore bandwidth measurement studies using Vtune.

I am not sure of anything else which might contribute to the poor scaling. I would like to take your suggestion in this study.

Thank you.

 

 

0 Kudos
33 Replies
jimdempseyatthecove
Honored Contributor III
1,190 Views

CPUID is a processor instruction that returns information about the processor. You pass in one or two arguments in EAX and for some calls a second argument in ECX (use 32-bit registers in 64-bit mode too). The instruction returns 1 to 4 values in EAX, EBX, ECX, EDX. The compiler has an intrinsic for the one input arg call _cpuid, but not for the two input call, typically called _cpuidEX when using the MS compiler.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-2a-manual.pdf

The above link has information on CPUID (though there are other sources). Your O/S may have available to you a different API that provides the mapping. Note, the calls I showed bring back information in bit fields. You can write an overlay struct to clarify what is being extracted. And/or place a shell function around the call to return one piece of information.

Also note, AMD has some compatibility and incompatibility with the placement of the information. So you may need to write two variants on the methods to extract the information you want.

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,190 Views

I hope you read Tim's comment carefully. Scaling is not equivalent to raw system performance.

A program using scalar instructions (single float per floating point operation) can show better scaling than a program using small vector instructions (8 floats per floating point operation with AVX). Yet the throughput of the latter is better.

In an automobile race, it it the car that finishes first, not the one with the highest engine RPM.
Though there are different CPU races FLOPS/Watt

Jim Dempsey

0 Kudos
Surya_Narayanan_N_
1,190 Views

 

Vectorization typically cuts into threaded scaling, as it would usually boost performance of a single thread more than multiple threads.

 

I get your statement as vectorization is something to do with micro-architecure and not the application. But I had a small doubt that can vectorization make any difference in cache behavior and will it have any impact on the cache coherence? I checked my benchmarks with -vec-report and most of the loops are vectorized and some are partially vectorized.

 

Evident reasons are that bandwidth and cache capacity saturation occur sooner, and in some cases, granularity of workload is reduced.

 

Yes, Now i feel that I should look at the bandwidth and cache capacity saturation. Am thinking of using VTune core based bandwidth analysis to determine the L2 events and compute bandwidth. Though am not sure can i get the time varying behaviour threadwise on command line (as the GUI reads the binary file and displays the information).

My another doubt is phi has a distributed cache controller for all the L2 caches. So does this L2 cache behave like a LLC?

TimP (Intel) wrote:

If the only goal is to increase the ratio of multi-threaded performance to single thread, an obvious method is to minimize the performance and memory consumption of each thread.  Intel had internal goals along this line when Hyperthreading and multi-core were first introduced.

 

 

 

0 Kudos
TimP
Honored Contributor III
1,190 Views

Stride 1 aligned vectorization on MIC reads entire cache lines, so would tend to promote cache in some situations.  I don't entirely understand your question.

The L2 caches of all the KNC cores communicate in a ring fashion which is somewhat like a partial LLC but evidently without any ability to cache an extra copy of the data, and no extended size cache.  Obviously, I'm not using technically correct terminology.

VTune general events report when L2 is missed on a core whether the miss goes to memory or to another L2.  It's mostly left to the user to interpret such results.
 

0 Kudos
Surya_Narayanan_N_
1,190 Views

jimdempseyatthecove wrote:

A program using scalar instructions (single float per floating point operation) can show better scaling than a program using small vector instructions (8 floats per floating point operation with AVX). Yet the throughput of the latter is better.

yes, what you said makes sense. But in xeon-phi computing element is not a bottleneck for scaling as we have still lot of cores. So my question: is it really concrete that vectorized code (which has poor scaling )outperforms scalar code (which has better scaling)? considering bandwidth or cache is the vector bottleneck for scaling, are we ignoring the other way (i,e try to utilize more cores and reduce bandwidth/communication pressure?) to push the performance or is it practically not feasible?

0 Kudos
McCalpinJohn
Honored Contributor III
1,190 Views

For unit-stride data accesses, vectorized code generates much better sustained memory bandwidth than non-vectorized code, and non-vectorized code shows significantly better sustained memory bandwidth than x87 code.  So it is quite common for vector code to have poorer scaling ratios but better absolute performance than non-vectorized code.

In the training exercise from the Xeon Phi tutorial that I recently helped teach at SC13, updating a 512x512x512 array with the product of three 512-element vectors (i.e., M += x1*x2*x3) delivered about 67 GB/s from memory using vectorized code, 20 GB/s using non-vectorized code, and under 8 GB/s using x87 code.

It may seem non-intuitive, but vector code allows the core to generate more outstanding cache misses than non-vector code, and this concurrency is what controls the sustained memory bandwidth.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,190 Views

On a single Xeon Phi you have a limited number of cores ~60. A scalar implementation of code that scales better than a vectored implementation of code may never catch up to surpass the vectored code. This is condition is typically observed. Sketch of what is possibly skewing the results data to fool you into thinking otherwise:

Scalar code processing stream-able data, consisting of float:

fetch(0)/miss/op, fetch(1)/hit/op, fetch(2)/hit/op, fetch(3)/hit/op, fetch(4)/hit/op, fetch(5)/hit/op, fetch(6)/hit/op, fetch(7)/hit/op, fetch8()/hit/op, fetch(9)/hit/op, fetch(10)/hit/op, fetch(11)/hit/op, fetch(12)/hit/op, fetch(13)/hit/op, fetch(14)/hit/op, fetch(15)/hit/op, fetch(16)/miss/op, fetch(17)/hit/op,...

IOW you observe 15:16 L1 cache hit ratio (93.75%) and observe the same on the other 59 cores (or 60 cores). Additionally, the memory fetches between some cores can occur during the the L1 hit periods of the other cores thus contributing to the better scalability of the scalar implementation.

The (well written) vector, consisting of 16 floats looks like this:

fetch(0:15)/miss/op, fetch(16:31)/miss/op. fetch(32:48)/miss/op, ...

IOW you observe 0:16 L1 cache hit ratio (0%) and observe the same on the other 59 cores (or 60 cores). However, note that throughput of the vector code is higher than that of the scalar code. Further, on large problems, the limiting factor may be memory bandwidth. However, with proper coding strategies one can increase the L1, and/or L2, and/or neighboring L2 cache hit ratios, thus improving throughput.

Comparing the two cases, one can choose the statistics they want to tout the results they want.

Also note that once the problem requires multiple Xeon Phi, then you have a whole different issue with respect to scaling as well as system throughput.

For someone using the code, throughput (on their system) is the only important metric. (Well for some it may be throughput/watt.)

Additional note w/rt Xeon Phi. As you use the coprocessor, and code well, you will find that for some (many) types of problem that using three of the four hardware threads per core yields better throughput. Now then, if you were to compute scaling, would you include or exclude the unutilized threads?

[cpp]

REAL=float array size 512x512x512          
program  Gflops Throughput x base scale/core scale/tu scale/ta
                (GB/s)
base       0.243   0.22     1.00     1          1        1
omp       10.637   9.82    43.86   0.73      0.18        0.18
ompvect   78.816  72.75   325.00   5.42      1.81        1.35
peel      82.584  76.23   340.54   5.68      1.89        1.49
tiled    105.521  97.40   435.12   7.25      2.42        1.81
tiledHT  119.264 110.09   491.80   8.20      2.73        2.05
tiledHT2 128.614 118.72   530.35   8.84      2.95        2.21
[/cpp]

Assuming the above tabulates correctly.
base is single thread scalar
omp is 240 sw threads (all hw threads)
ompvect is 180 sw threads 3/core
peel is further optimized version of ompvect 180 threads 3/core
tiled is peel with tiling (blocking) to improve L1 hit ratios 180 threads 3/core
tiledHT 3 threads per core all 60 cores ganged to improve L1 hit ratios
tiledHT2 similar to tiledHT additional tricks to improve L1 hit ratios

*** I am not claiming these scaling factors because these are (IMHO) erroneously based on single thread scalar.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,190 Views

scale/tu == scaling for threads used
scale/ta == scaling for threads available

A proper (IMHO) way to factor scaling is to fully optimize the code including vectorization, peeling, bocking, aligning for the single thread base, then comparing the effects as you add cores, producing 1, 2, 3, and 4 values (one for each number of threads/core). In addition to scaling factors, also include throughput.

Jim Dempsey

0 Kudos
Surya_Narayanan_N_
1,190 Views

@jim and john...i appreciate your time for explaining me things very clearly. I also did some experiments with my benchmarks to see what is the effect of vectorization on scalability and performance. I could relate your comments very well to my results.

INFO:

I compiled in following modes

1. O3 -no-vec 2. O3 (vectorized) 3. O0

there was not much performance or scalability difference between 1 & 2. But 3 proved to be totally different. The programs which were only scaling till 32 or 64 cores, scaled to 192 and 234 cores but still the peak performance was around 75-80%.

My main intention behind all these experiments was to find out is bandwidth really the bottleneck which saturates scaling and performance and not they synchronization overhead? I think I can conclude now that these benchmarks do have higher scaling potential and bandwidth is not the real bottleneck (ignoring the peak performance for the moment).

I also wish to know whether has there been any study on predicting bandwidth? Am not sure about where to start a bandwidth study for xeon-phi. I would like to seek your advice for this.

0 Kudos
Surya_Narayanan_N_
1,190 Views

*predicting bandwidth utilization when number of threads are scaled for a particular application.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,190 Views

You may be able to get the metrics you want (predicting bandwidth utilization) using VTune. What you would want to look at is the ratio of instruction cycles verses memory cycles. This may indirectly tell you the memory bandwidth. Please note that by using your -O0 test with "higher scalability" that you are inserting more non-memory cycle code into your program (most of the additional code will hit cache, and all of the additional code is unproductive excepting for diagnostics).

The technique then to improve scalability is to reduce the number of memory cycles to process the data as opposed to increasing the number of non-memory memory cycles (thus boosting the scalability figures).

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
1,190 Views

For relatively simple applications it is often possible to predict the required memory *traffic* by inspection and analysis.    One example is available at http://www.cs.virginia.edu/stream/Analyses/SWIM-BW.html

Predicting the resulting memory *bandwidth* is much harder, since that requires understanding the interactions among all the components involved in data motion through the memory hierarchy along with the components involved with maintaining cache coherence on that data. 

The Xeon Phi is only moderately complex in its core design, but the extended cache coherence protocol creates new interactions and the sheer number of interacting components leads to behavior that is quite difficult to understand in detail, and therefore even harder to predict in advance.

I have found that the best performance for memory-bandwidth-limited codes comes when you run on as many cores as possible while not generating more than 128 independent read streams (which is the effective number of DRAM banks on the Xeon Phi SE10P).   This is a relatively big topic, so I will put more details on my blog.

0 Kudos
Surya_Narayanan_N_
1,190 Views

@John

 I would like to correlate or verify the Vtune bandwidth analysis with the bandwidth shown by stream benchmark of yours. I tried doing it with the formula by collecting the hw-events as shown in 

http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

I really find the numbers misleading. According to the formulae I get stream process bandwidth around 4 GB/Sec and the bandwidth reduces with number of threads. I collect the samples with SA:100000.

On the other hand, according to stream benchmark copy has sustained bandwidth of 10GB/Sec for 2 threads, 20 GB/Sec for 4 threads and almost 140 GB/Sec for 128 threads.

 

0 Kudos
Reply