Quote:TimP (Intel) wrote:

Surya_Narayanan_N_ · ‎11-21-2013

Hello,

I am running some multithreaded benchmark programs in Mic. Some programs don't scale beyond 32 threads and some beyond 64. I am trying to find out the reason why they are not scalable beyond certain number of threads. Definitely, the poor scaling is not a result of lack of computing resources (i,e we can run 244 hw threads without the problem of context switching).

I am trying to analyze this using Vtune but am still not sure how to study this issue.

1.Vtune Locks and waits analysis doesn't work in Knc (mic). So I don't know how to find whether the locks are the issues?

2. Bandwidth? As more threads are spawned and if they use lot of shared data, there can be an issue of cache coherence eating up the bandwidth which can be studied using core/uncore bandwidth measurement studies using Vtune.

I am not sure of anything else which might contribute to the poor scaling. I would like to take your suggestion in this study.

Thank you.

jimdempseyatthecove · ‎11-22-2013

While you have 244 hardware threads, this is run in 61 cores (depending on Phi this may be 240 threads in 60 cores).

For a simple test, try setting up your KMP_AFFINITY=scatter.

Then for use of number of threads in range 1-61 you will be useing one HT per core.

Note, one HT per core is not optimal for Xeon Phi, however give the above a try.

Next try

KMP_AFFINITY=scatter
KMP_PLACE_THREADS=61c,2t,0o

This will use two threads per core. Also experiment with 3t as well as KMP_AFFINITY=compact

Depending on how you partner your threads up, you can improve cache hit ratios.

Jim Dempsey

TimP · ‎11-22-2013

As already mentioned, pinning is usually required to see effective performance scaling beyond one thread per available core, particularly when there is data sharing between threads. KMP_PLACE_THREADS automatically defaults num_threads to match the designated resources; I typically get best performance with KMP_PLACE_THREADS=58C,3t KMP_AFFINITY=compact

leaving at least 1 core to the MPSS.

If you don't use pinning, omp teams offers the possibility of grouping threads together on individual cores:

#pragma omp teams num_teams(60) thread_limit(4)

When you allude to cache coherence issues among cores, the main issue would be false sharing, where a cache line is updated on one core and then read on another. Such problems ought to show up in VTune KNC general analysis as L2 write cache miss cache fill. But that might be expected to show up as soon as you run multiple cores.

robert-reed · ‎11-22-2013

DO your threads employ a lot of shared data? DOES your code contain critical sections and other synchronizers for handling shared data? Critical sections impose serialization, which can be deadly on the coprocessor. VTune Amplifier XE does not support Locks and Waits analysis on the coprocessor as you mention, but you should be able to perform the same analysis on the same code on the host side, where such a collector is available, and you can diagnose potential locking problems there and get a sense, for a least the smaller thread counts whether there is the potential for more serious problems as the thread count risies. Or you could try Intel Advisor on your code, which has a projection mode that, used in combination with the data collection from trials with your code, can give a scalability estimate for expanding to the number of available HW threads on the Intel Xeon Phi coprocessor. Does your code employ offloading (which might seem to make it harder to run just on the host)? Then compile your offload code using the -no-offload compiler switch, which will turn those offload directives into noops so that you can do a completely host-side analysis for locks and waits.

And you could share more of what your code is doing, which might trigger insights from this community about lessons learned in similar code.

Surya_Narayanan_N_ · ‎11-22-2013

sorry for long time i had some filter issues and not able to post here in forum.

Hello Jim,

I think this KMP AFFINITY works only for programs which uses OPENMP. most of my programs uses Pthreads so in this case pthread_set_affinity() is the only solution?

If I understand it right the bottleneck here is the bandwidth which can be caused by higher L2 cache miss per core and due to their locality. I have one more question. Can synchronization overhead be overlapped with bandwidth issue? ( I mean bandwith study can superpose the synchronization overhead study or should both be considered a seperate component of scaling?)

Surya_Narayanan_N_ · ‎11-22-2013

@reed and tim

I am doing a coarse level study without finding out what each benchmark does. They are basically PARSEC(regular benchmark, which has normal data structures) and Lonestar benchmarks( irregular benchmarks which are pointer based datastructure algorithms which uses graph or tree based ).

robert-reed (Intel) wrote:

DO your threads employ a lot of shared data? DOES your code contain critical sections and other synchronizers for handling shared data? Critical sections impose serialization, which can be deadly on the coprocessor.

Any shared data structures in these benchmarks will create lot of data transfer and bandwidth will become the bottleneck (and not the processing cores) as we increase the number of threads. But can it be the only reason for poor scaling in xeon-phi?

should I consider synchronization overhead and bandwidth issue seperately or Can bandwidth study from the core (with the bandwidth formula given in the xeon-phi book or tutorials) reveal the synchronization effect?

robert-reed (Intel) wrote:

VTune Amplifier XE does not support Locks and Waits analysis on the coprocessor as you mention, but you should be able to perform the same analysis on the same code on the host side, where such a collector is available, and you can diagnose potential locking problems there and get a sense, for a least the smaller thread counts whether there is the potential for more serious problems as the thread count risies.

Above benchmarks scale upto 32 threads in xeon-phi to what i have seen. Host has multi-socket processors with LLC but phi is totally different. Do you think that the lockandwait (algorithmic) study with small threads in host can have similar trend (same effect) when ran on xeon-phi?

robert-reed (Intel) wrote:

Or you could try Intel Advisor on your code, which has a projection mode that, used in combination with the data collection from trials with your code, can give a scalability estimate for expanding to the number of available HW threads on the Intel Xeon Phi coprocessor.

I would like o have more information on this. I will try to download the software and check it.

robert-reed (Intel) wrote:

Does your code employ offloading (which might seem to make it harder to run just on the host)? Then compile your offload code using the -no-offload compiler switch, which will turn those offload directives into noops so that you can do a completely host-side analysis for locks and waits.

No, Am not using offload at all. I work on native machine i,e coprocessor.

TimP · ‎11-22-2013

Surya Narayanan N. wrote:

I think this KMP AFFINITY works only for programs which uses OPENMP. most of my programs uses Pthreads so in this case pthread_set_affinity() is the only solution?

If I understand it right the bottleneck here is the bandwidth which can be caused by higher L2 cache miss per core and due to their locality. I have one more question. Can synchronization overhead be overlapped with bandwidth issue? ( I mean bandwith study can superpose the synchronization overhead study or should both be considered a seperate component of scaling?)

It should be possible to have OpenMP capture the threads you start up by pthreads and thus apply the OpenMP affinity stuff (which is implemented in pthreads). If you prefer the low level way, you may be re-inventing things by dealing directly with pthread_set_affinity.

If your application is limited by memory bandwidth, performance will peak somewhere below where you are using 4 threads on nearly all cores. You would want to check that software prefetch is working effectively.

Surya_Narayanan_N_ · ‎11-23-2013

TimP (Intel) wrote:

It should be possible to have OpenMP capture the threads you start up by pthreads and thus apply the OpenMP affinity stuff (which is implemented in pthreads). If you prefer the low level way, you may be re-inventing things by dealing directly with pthread_set_affinity.

I am not understanding how to do this. Should I build a wrapper/add openMP get_number_threads in every benchmark program then utilize the KMP_AFFINITY env variable to control the thread affinity in the program?

TimP · ‎11-23-2013

Yes, adding -openmp compile and link option and header and calling omp_get_num_threads() should enable the OpenMP affinity controls (KMP_AFFINITY, KMP_PLACE_THREADS, OMP_PROC_BIND, OMP_PLACES) when the OpenMP is using the same pthreads library.

jimdempseyatthecove · ‎11-23-2013

RE: pthread_set_affinity()

However, you need to do more than this.

__cpuid in an intrinsic available in the Intel C++ compiler (Linux and Windows) and available in MS compiler (on Windows).
Same format as __cpuid except for missing func_c.

Add code to use __cpuid, and add code to provide _cpuidEX.
[cpp]
inline void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
{
int eax, ebx, ecx, edx;
__asm__ __volatile__ ("cpuid":\
"=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (func_a), "c" (func_c));
cpuinfo[0] = eax;
cpuinfo[1] = ebx;
cpuinfo[2] = ecx;
cpuinfo[3] = edx;
}
[/cpp]

Have an init function that uses:

[cpp]
__int32 cpuinfo[4];
__cpuid(cpuinfo, 1); // get extended feature
int MaxLogicalProcessorsInPackage = (cpuinfo[1] >> 16) & 0xFF;
__cpuidEX(cpuinfo, 4, 0);
if((cpuinfo[0] & 15) != 1)
__cpuidEX(cpuinfo, 4, 1);
// We should have the 1st data cache (L1)
// Note, you can expand the code if you want sanity checks
int MaxThreadsInCache = ((cpuinfo[0] >> 14) & 0xFFF) + 1;
[/cpp]

Then has a loop with pthread_set_affinity() to test each (O/S) logical processor using:

[cpp]
__cpuid(cpuinfo, 1); // get extended feature
yourLogicalProcessorInfo.APICid = (cpuinfo[1] >> 24) & 0xFF;
yourLogicalProcessorInfo.Core = yourLogicalProcessorInfo.APICid / MaxThreadsInCache;
yourLogicalProcessorInfo.HT = yourLogicalProcessorInfo.APICid % MaxThreadsInCache;
[/cpp]

You will have to declare a struct for yourLogicalProcessorInfo QED

With this information at hand, you can now use the core number and HT number within the core, to aid you in thread placement.

For Xeon Phi, general rule of thumb is to use at least 2 HT's from each core ~2x more work,
3 HT's is often better not necessarily +1x, and 4 HT's most often is slower than 3 HT's excepting when there is a large
degree of last level cache miss.

Jim Dempsey

Surya_Narayanan_N_ · ‎11-23-2013

I want to confirm this /proc/cpuinfo has 240 threads and 60 cores. Is the OS/data transfer/housekeeping running in any of these?

Can I use all these 60 cores (i,e core id 0:59) for my application threads?

TimP · ‎11-23-2013

Core 0 (thread IDs 0,237-239 on a 60-core KNC) will have MPSS activities running and many applications will do better to avoid that core. Host data transfer is specialized to core 0.

Surya_Narayanan_N_ · ‎11-24-2013

Sorry for such elementry doubts but certain things confuse me in KNC.

TimP (Intel) wrote:

Core 0 (thread IDs 0,237-239 on a 60-core KNC)

Are these thread IDs same as apicid numbers in /proc/cpuinfo?

if so, what does the other 4 logical cores, i,e 241-244 which is not mentioned in /proc/cpuinfo do?

so out of 61 cores only 59 cores are for application?I use htop to see where the application threads are running. By default when i spwan threads i see it 1 thread running per core till

TimP · ‎11-24-2013

On a 61-core MIC, that would be 0,241-243 which are likely to be occupied by MPSS functions.

These differences among models certainly are annoying. In the future 14nm MIC generation, there is expectation that a minimum of 64 cores are fully available to applications.

Some applications may be able to use some of those threads when not being profiled, but the VTune profiler puts more load on core 0.

TimP · ‎11-24-2013

On a 61-core MIC, that would be 0,241-243 which are likely to be occupied by MPSS functions.

These differences among models certainly are annoying. In the future 14nm MIC generation, there is expectation that a minimum of 64 cores are fully available to applications.

Some applications may be able to use some of those threads when not being profiled, but the VTune profiler puts more load on core 0.

I don't have an explanation why many of my tests show performance saturation when using 59 or 58 cores, but this is somewhat dependent on array sizes, loop counts, etc.

Surya_Narayanan_N_ · ‎11-24-2013

@tim

the numbers 0,237-239(on a 60 core machine) or 0,241-243 (on a 61 core machine) are refered to processor field of /proc/cpuinfo right? If yes, they belong to physical core 59 and 60 respectively. It doesn't seem to be core 0. OS and MPSS runs on core 59. this is why I think your tests show performance saturation when using 59 or 60 cores.

If you have refered those numbers to apicid, then it is bit confusing as 0 falls in core 0, 237-239 falls in core 59.

TimP · ‎11-24-2013

No, they tell us that this strange numbering is a consequence of the way linux works. Core 0 gets thread 0 but core 1 gets threads 1-4, leaving the remaining threads of core 0 to the end of the list. Some of the OpenMP tools help you with resolution of this, others expose it.

jimdempseyatthecove · ‎11-24-2013

Surya,

The APICID's held by the processor are not necessarily assured to be all cores nor all used. Some APICID's can be used by other functional units within the chip, while other APICID's may be reserved for expansion or convienence of internal routing. Though, in practice, the first core is the base APICID for the pocessor, you are not assured that the processor is based at APICID==0. This may be true today, tomorrow you may find a card with multiple Xeon Phi successor chips with wider APICID bit field (or in two parts).

Additionally, the O/S may currently assign its logical processor 0 to APICID 0, or more correctly to say to the lowest APICID of the processor which need not be 0. And, you cannot be assured that the O/S will map what you see as logical processor 0 to the lowest APICID of the system.

In Linux, the O/S should provide you with a bitmap of logical processors permitted to be used by your process. The O/S may constrict you to a subset of the system processors/cores. So do not write your code with any assumptions about the logical processors nor the APICID's - use discovery or library functions if these are available (as CPUID tables may change or differ amongst hardware).

TimP,

>>On a 61-core MIC, that would be 0,241-243 which are likely to be occupied by MPSS functions.

Are these Linux logical processor numbers .OR. are these APICID's? (or both)
Note, if these are APICID's then the MPSS functions are using two cores, not one.

The reason I ask is on a 60 core MIC (5110P), using compact, omp thread_num == APICID (assuming not oversubscribed).
This would place last core at 236-239 (and first core at 0-3).

The document writers need some 'splain'n to do.

Personally, if you run in offload mode, where MPSS is used heavily, I'd prefer either:
a) it pins to the last core, or
b) it does not affinity pin

This means applications do not have to be written with "funny business" code.

Jim Dempsey

McCalpinJohn · ‎11-25-2013

The best way to understand the core mapping is to write a trivial OpenMP program and then run it with the environment variable KMP_AFFINITY set to include the "verbose" option (in addition to the "compact", "scatter", or "balanced" thread placement option), and the variable OMP_NUM_THREADS set to whatever number of threads you think is interesting. When you run the code the OpenMP runtime will print out the entire table of "logical processor" to "core number plus thread context" mappings, and will then print out the mapping of the OpenMP threads to "logical processors".

This prints a lot of lines, but the topology information and OpenMP thread placement information is clear and unambiguous. I leave this enabled on most of my jobs despite the verbosity so that I can use it later to verify the placement for jobs that have anomalous performance characteristics.

The KMP_PLACE_THREADS variable can be used to limit the cores and thread contexts available to the threads of an OpenMP job, while the KMP_AFFINITY variable controls the binding of the threads to that set of resources. KMP_PLACE_THREADS provides essentially the same capability as the inline sched_setaffinity() call, but without requiring source code modification.

Once you have studied the "verbose" output from "KMP_AFFINITY", you will know which physical core contains logical processor 0 (zero). In a "native" application there is nothing to prevent you from using all the cores -- including logical processor 0. Whether using them all is a good idea depends on how much OS activity your job generates. My experience has been that some jobs get a speedup when going from 60 cores to 61, but when contention occurs the slowdown is much larger than potential gain.

If I understand correctly, Intel's runtime prevents you from using the physical core containing logical processor 0 when you are running in "offload" mode. This seems like a good idea, since the "offload" model benefits from having a processor devoted to controlling data transfers and synchronization between the host and the coprocessor. In "native" or "symmetric MPI" modes, the user gets to choose whether to use the core containing logical processor zero.

Surya_Narayanan_N_ · ‎11-28-2013

Thank you for your help. I have found this tool called Likwid-pin which can pin any pthreaded applications in the way you want it without modifying the program. Now I find that certain benchmarks scale almost till 64 cores though there is no major difference between 48 and 64 cores in most of them.

My next question is will vectorization improve scaling of these benchmarks further?

@jim

I did not understand the pthread modification you had mentioned sometime back with those extra code to be added which uses all the registers. what do they signify?

TimP · ‎11-28-2013

Vectorization typically cuts into threaded scaling, as it would usually boost performance of a single thread more than multiple threads. Evident reasons are that bandwidth and cache capacity saturation occur sooner, and in some cases, granularity of workload is reduced.

If the only goal is to increase the ratio of multi-threaded performance to single thread, an obvious method is to minimize the performance and memory consumption of each thread. Intel had internal goals along this line when Hyperthreading and multi-core were first introduced.

How to find the reason for poor scalability?