Thank you for your response.

Sai_Sudheer_G_ · ‎02-08-2015

Hello All, I'm trying to predict the performance of openmp programs on intel xeon phi using machine learning techniques. However, I wanted to know if there's any work that has been done on the same (I could not find any) and if not, is there a tool to obtain the static analysis of the code, more specifically, to obtain a region wise summary of the code that would indicate the number and type of operations that are executed in each region. Thank you, Sudheer

Sunny_G_Intel · ‎02-09-2015

Hello Sudheer,

Would you please elaborate about the type of analysis you intend to do. Have you tried Intel® VTune™ Amplifier for performing static analysis of your code. You can evaluate and purchase Intel® VTune™ Amplifier 2015 from this website.

Thanks

jimdempseyatthecove · ‎02-09-2015

In order to yield better performance than a sizeable host (2, 4, Xeon processors) the code run inside the MIC (generally) must have two attributes:

a) it must scale well to 120 to 240 threads, .AND.
b) must vectorize well using 512bit wide vectors.

When both are not true, you will not attain the full performance of the Xeon Phi.

It is difficult to predict the performance other than by detailed analysis (black art) or educated guess. On the educated guess side, use VTune as Sunny suggest to

a) determine how much time is spent outside the parallel regions, verses inside the parallel regions
b) what is the average trip time through each parallel region (iow how much computation there is per entry, per regions)
c) Determine (estimate) the affect of changing the n-way partitioning of your work on your host to 120-240 way partitioning on Xeon Phi.

With the statistics you can roughly approximate the performance gains (loss). You also have to factor in getting the data into and out of the Xeon Phi.

Jim Dempsey

Charles_C_Intel1 · ‎02-10-2015

I would add a third requirement: it must make effective use of memory on all those vectorized threads. If 240 threads are all asking for memory that isn't in cache most of the time, don't expect great performance. If the problem is blocked so that caches are used effectively, then those 240 vectorized threads of execution can sing! Note: this is not easy.

Charles

Sai_Sudheer_G_ · ‎02-10-2015

Thank you for your response. However, here is my approach. The idea is to run and profile a set of benchmark applications on say, Xeon processor and collect the metrics (using VTune). Run the same applications on Xeon phi and collect the metrics. Now, for the target application, correlate it with one of the benchmark applications and try to predict the performance from the info gained from that benchmark. I remember the same being done on GPUs, where we try to predict on say K80 based on the performance on K40. But, in this case, we don't have two models of xeon phi, so I'm not sure if this approach is valid. Any suggestions in this direction would be of great help. Thank you, Sudheer Thanks.

jimdempseyatthecove · ‎02-11-2015

The approach you outline is valid... provided you do not intend to tweak your program to take advantage of the architectural differences between host processor(s) and Xeon Phi processors. While it may be true that you do not need to change your program to get it to run, most programs will run better, in many cases much better, with some tweaking. It should be noted, that the tweaking you do to improve the Xeon Phi almost always improves the performance on the host as well. The attributes of the Xeon Phi motivates you to think in terms of parallelization and vectorization. To give you some overview, look at some white papers on Colfax: http://www.colfax-intl.com/nd/resources/whitepapers.aspx

If you are into quality books, look at:

http://www.amazon.com/High-Performance-Parallelism-Pearls-Programming/dp/0128021187

http://www.amazon.com/Intel-Xeon-Coprocessor-High-Performance-Programming/dp/0124104142/ref=pd_sim_b_5?ie=UTF8&refRID=1ET22XF1PSS7TCEMQHBS

The pearls will give you an in-depth perspective of the impact of converting real-world applications to Xeon Phi (as well as some fundamental conceptual programs).

The second book is more of a programmers guidebook. This is not a programmers reference manual for Xeon Phi, rather it is a vary good teaching tool for learning how to use, and use well, the Xeon Phi.

Jim Dempsey

Predict the performance on Intel xeon phi