I am comparing different shared-memory programming models on the Xeon-Phi. What I get as a result of comparison makes it a bit hard for justification.
I am doing the same computation (say Matrix Multiplication), and ..
1- for approaches in Group1, I get smaller CPI Rate, but a bit larger elapsed time.
2- For the programming models in Group2, I get higher CPI rate, but slightly better elapsed time.
When I went into the details, I realised that the number of Instructions Executed is another big difference. So I interpret the results like this:
Group1 models try to utilise the caches. For that purpose they execute more instructions, and as a result the average CPI rate is improved a bit. But it sometimes results in larger elapsed time as well.
Group2 that have smaller number of Instructions Executed, would have higher CPI rate (possibly as a result of more cache misses), but sometimes can run the program faster than (or at least as fast as) the approaches in the Group1.
Do you think it is a valid conclusion? What else can cause higher number of Instructions Executed?
Thanks in advance!
Hmm...I was asked same question before. Please note if execution instructions are different, there is meaningless to compare CPI values, for example: if your group 1 has CPI low, but instruction is high, it seems elapsed time is big, but code in group 2 might be optimized - e.g. loop has been optimized into one instruction (e.g. array proceeding), or some FP operations has been compiled with SSE or AVX instructions. Even instructions were reduced, but CPI was high (Avg. cycles/instructions retired). In this way, group 2 is not bad result...
So compare CPI value of results, they are with (almost) same instructions... then investigate.
@ Ashkan T
My recommended steps are:
1) Compare CPU cycles on hot functions or runtime libraries from two approaches. Using different system may impact on the performance...so start working on one system, first.
2) Observe cycles, instructions retired from different approach, if instruction counts are similar then CPI is different, use CPI value to know which approach is better.
3) Usually CPI of SSE/AVX approach is high, it doesn't imply that performance is low...see CPU cycles.
4) Investigate result of (2) & (3) to verify if there is any opportunity to optimize code on microarchitecture level, for example - memory layout to best utilize cache, etc.
Hope it helps.
Many Thanks Peter!
Just some clarifications:
1) I meant different runtime systems (e.x. OpenMP vs Cilk Plus). I do all the experiments on one system.
2) & 3) & 4) I will. Thanks! Although, I am not sure if it is possible to compare them function by function on the Xeon Phi. Vtune does not support it. right? In that case, I will only look at the total numbers (but again, some differences could be due to different behaviours of runtime systems)
For repeatable meaningful results with OpenMP you will set affinity, an option you don't have with Cilk(tm) Plus. I don't believe currently there are matrix libraries using Cilk threading model. It's very difficult to approach the performance of the libraries by your own source code, if the matrices are big enough to make MIC interesting. The professionally built libraries will use techniques for improved cache locality and reduced memory access and prefetch requirement, but may also use a significant number of instructions shuffling data.
If you're not seeing results by function in VTune, it may be on account of not setting the source and binary search paths in VTune analyzer, as well as building with -g.
I was thinking of host when suggesting no software pretetch. On MIC, properly implemented prefetch may help performance more than it increases instruction count. Introducing extra ineffective prefetch is a way of increasing instruction count , so it is a way of reducing CPI without improving performance.
@ Ashkan T
It is hard to compare them function by function, since they are implemented in different libraries (for example, OpenMP*, Clik++, TBB or native pthread, etc). The idea is to view top-down report which has call sequence with performance data, you may observe Total CPU time of node. Unfortunately "knc-hotspots with call stack collection" feature for Xeon Phi is not ready in Update 17.
Currently you may compare performance in module level.
Thanks for your answers.
*You are right. We do not plan to beat the performance of famous libraries. Just trying to observe the behaviour of different models.
*Thanks for the information on prefetching. But as I said, it is a naive multiplication code. Probably some runtime systems do something non-efficient in the background.
You are absolutely right. Maybe I have to try them on some other machines to see whether I can find the reason, since still "call stack collection" does not support the Xeon Phi.