I am running some multithreaded benchmark programs in Mic. Some programs don't scale beyond 32 threads and some beyond 64. I am trying to find out the reason why they are not scalable beyond certain number of threads. Definitely, the poor scaling is not a result of lack of computing resources (i,e we can run 244 hw threads without the problem of context switching).
I am trying to analyze this using Vtune but am still not sure how to study this issue.
1.Vtune Locks and waits analysis doesn't work in Knc (mic). So I don't know how to find whether the locks are the issues?
2. Bandwidth? As more threads are spawned and if they use lot of shared data, there can be an issue of cache coherence eating up the bandwidth which can be studied using core/uncore bandwidth measurement studies using Vtune.
I am not sure of anything else which might contribute to the poor scaling. I would like to take your suggestion in this study.
Am doing a coarse level study without finding out what each benchmark does. They are basically PARSEC(regular benchmark, which has normal data structures) and Lonestar benchmarks( irregular benchmarks which are pointer based datastructure algorithms which uses graph or tree based ).
Am trying to measure how to measure synchronization overhead in xeon-phi. Any shared data structures in these benchmarks will create lot of data transfer and bandwidth will become the bottleneck (and not the processing cores) as we increase the number of threads. But can it be the only reason for poor scaling in xeon-phi? should I consider synchronization overhead and bandwidth issue seperately or Can bandwidth study from the core (with the bandwidth formula given in the xeon-phi book or tutorials) reveal the synchronization effect?
For OpenMP codes, the usual approach to estimating synchronization overhead is to run the EPCC OpenMP benchmarks. These don't reveal the details, but they do provide a common starting point for comparison to other systems.
For Xeon Phi, understanding synchronization overhead is quite difficult because the cache-to-cache intervention latencies for cores that are "close" on the ring vary by a factor of 3 -- from about 130 cycles to almost 400 cycles -- depending on the address being used (which controls the location of the distributed tag directory used to manage the coherence transaction). The address mapping is not published, but the very low overhead of the RDTSC instruction on Xeon Phi (about 5 cycles) allows one to directly measure the latency of each load independently. E.g., for any pair of cores one can easily measure the latency for cache-to-cache interventions using a range of addresses to look for "good" ones.
Because of this variability in the coherence protocol (and just to keep the methodology as clean as possible), I recommend studying memory bandwidth and synchronization issues independently.
maybe it will be useful
In this article synchronization is considered as a set of overhead elementary states of cache coherent protocol