Can Vtune 9.1 Update 2 be used to instrumentcode running concurrently on multiple nodes? If so, where do I need to go documentation wise to figure out how I need to install and configure things?
Or, is this a question best left to the Intel support team?
Intel VTune Performance Analyzer is not designed for cluster system. However you can simulate distributed computing in one node, and refer to http://software.intel.com/en-us/articles/performance-tools-for-software-developers-does-vtune-analyz...
A possible way of profiling on multiple nodes is by using the PTU relative of VTune (see WhatIf forum) to generate an SEP batch command which may be run across a cluster under MPI, saving a tb5 file for each node.
As you're probably aware, specialized MPI profilers, such as jumpshot or Intel Trace Collector/Analyzer, are best for profiling to see the messaging paths and latencies. If you have Intel Trace Collector installed (basically, a profiling MPI library), and Intel MPI dynamic linked, you can activate profiling simply by adding -trace to the mpiexec command.
For MPI/OpenMP hybrid, the Intel profiling OpenMP library is useful for profiling the OpenMP process.