You may have received an email inviting you to the Intel® Parallel Studio XE 2016 Beta. VTune Amplifier XE 2016 beta is part of the studio and adds OpenMP* parallelization inefficiency, imbalance and work sharing analysis to tune for more efficient use of parallel regions. It also now supports multi-rank analysis of MPI* compute nodes with or without OpenMP use. Various ease-of-use enhancements include confidence indicators in General Exploration analysis results, "super tiny" bird's-eye view timeline, and "Platform" tab replacing "Tasks and Frames" tab.
According to release notes, it should support VS2015, but also it should support my Haswell laptop (which is a 2 year old model). However, I haven't been able to find a selection where "cpu specific" will run. "general" looks good. If it's necessary to search Intel web sites for hints, it will take a while to get up to speed.
We have got confirmation that questions are supposed to be asked on premier.intel.com, but that site is still broken. I tried to open a case on amplifier 2016 beta, which is on the menu, but a selection further down stays on 2013, and the next button doesn't respond.
Some of the new videos may be needed to get an idea where to look for useful new features, or enhancements of recent ones.
Of interest to me is the extent to which OpenMP performance (work imbalance, ....) can be diagnosed. It looks like a serious effort was made to replace the long ago lost feature of openmp profile. I'm seeing remarkably high Clocks/Instruction in parallel regions, particularly those using vgather, for which there's probably an architectural reason.
Annoying to Fortran users may be the use of C OpenMP terminology like parallel for in place of parallel do.
Due to my customers' interest, I would be interested in any tips about profiling boost threaded engineering applications. It seems the special hooks for parallelism in VTune apply only to OpenMP (better that than Cilk(tm) Plus). The basic timelines for boost master and worker threads have been informative and do shed light on work imbalance and multi-thread scaling. We concluded that 2015 update 2 did offer improvements over 2013 beyond the hope of supporting Haswell.
It was disappointing that Haswell support has been taking so long to release. The customers also had the problem about Red Hat making it very difficult to install debug libraries needed for a full g++ -g build.
Due to the restrictions we signed up for in the license agreement, I don't think it's possible to write up much about 2016 until the release occurs. All the experts on publishing have said there's no market or interest in publishing on VTune nor any programming language other than C++, although there are Intel people active on IDZ who offer to shortcut the normal multi-year approval process for white papers on IDZ.
The new Inspector flags data type conversions inside omp parallel regions (but not necessarily inside parallel for) as a potential problem. It claims there are differences in data width which I can't find, although there are intentional integer to and from real conversions. This doesn't appear to be so annoying in VTune.
The analysis with advanced hotspots looks quite favorable for the cases using explicit division of work among omp threads. Actually, there's a tradeoff with guided schedule having less setup overhead but showing some imbalance, even on a single cpu.
If I try to add loop counts, it asks me to copy into a custom analysis, but it doesn't enable the Start buttons.
I was wondering what accounts for the cases where an omp parallel appears twice in the openmp region cpu usage histogram. They have omp do nested inside omp parallel either with some extra book-keeping (real arithmetic on result of omp_get_num_threads) for each thread before the omp do or with 2 separate omp do loops. The 2 entries don't happen for a parallel including omp do and omp single with little else.
Again, have you checked out the What's New doc for explanations of the new OpenMP analysis features? One explanation might be the following paragraph:
Please note that the same lexical loop constructs with different schedule types or chunk sizes will be displayed separately in different rows. For example, if one instance had a chunk size of 1000 and another had a chunk size of 1563, there would be two entries for the construct with the same name but different sizes in the OpenMP Loop Chunk column.
A screenshot would be helpful, though, to understand exactly what you are referring to.
Yes that's the reason I wanted to see if it considered there are multiple lengths. Where there are 2 omp do loops, one is vectorized inner and the other could be vector outer so doesn't have as many physical trips. In the other cases there is no loop of different count but there is the scalar block prior to loop.
iliyapolak, could you provide more information about integration of Parallel Studio 2016 Beta with VS 2015 Preview?
So, You have got 'The error message stated that unrecognized version of Visual Studio was detected.' on what stage? Was it VS2015 registration flow passing Parallel Studio 2016 Beta Installation procedure? or in runtime after VS2015 Preview was started with integrated Parallel Studio 2016 Beta? Please post exact error message... May be you have some observations about what tool(VTune/Inspector/Advisor/Composer) raised the message assuming only one separated tool integrated in VS2015 (for example Composer(compiler) or VTune)? Also you may check if the folder %temp%\pset_tmp_amplifier_xe_2016 contains VTune installation LOG files... We may continue private communication to investigate the issue.
Your input is important for us!
I am not in front of my development pc now. Later I will try to integrate Parallel Studio 2016 with VS 2015 preview and I will collect any piece of information needed by you.
Btw, the integration was attempted on VMWare 10 Workstation.
Short summary from private communication is,
Intel Debugger extension from Intel Parallel Studio 2016 Beta does not support VS2015 Preview. But do not need to be afraid of installation messages 'The Intel(R) Debugger Extension for Intel(R) MIC Architecture cannot be installed' and/or
'The Integration(s) in Microsoft Visual Studio* components cannot be installed'. All tools of Intel Parallel Studio 2016 Beta (excluding 'Intel Debugger') will be integrated with VS2015 Preview after installation wizard completed (at least it is expected).