I would please like some help on the following. I am doing a general exploration experiment on a small test case run on 24 cores on liniux. While in the "hardware issues view I see CPI being over 1" and quite high on some particular threads(2.57etc) LLC, Contested Acceess, Branch prediction and data sharing shows to be 0. (As each thread is running on an array (size number of threads) I would expect to see data sharing actually. Instead I see nothing, so no explanation why CPI is high. My run is quite small. Is it possible I am not hitting a hardware count limit and I see nothing? Is it possible for me to adjust this?
> ...I see CPI being over 1" and quite high on some particular threads(2.57etc) LLC, Contested Acceess, Branch prediction and data sharing shows to be 0.
My impression has two possible reasons:
1. Did you use SSE/AVE instructions? It should cause CPI value >1, because of SIMD
2. Did you have IO wait or threads' stalling/suspending? You can use Locksandwaits analysis to inspect.
Hopefully you can share result directory if it is not sensitive.
As Peter hinted, CPI such as you quoted is normal for efficiently vectorized code. It doesn't make sense to sacrifice simd performance or emphasize spin wait loops for the sake of a lower CPI.
The type of cache sharing you would want to avoid is the one where write misses hit in cache (of other cores) which is characteristic of false sharing. You would also want to keep threads local to one CPU as much as possible, to minimize duplicating cache lines on both CPUs. In many cases, normal application of thread affinity is sufficient to keep these from posing difficulties.