If you're satisfied to collect events on a shared memory machine, you can set up a script which does the entire job, including environment settings, and run that under VTune. Or, you can make the mpirun the application run directly by VTune, and the entire remainder of the command line go in the corresponding entry in VTune setup. If you wish to collect events across a cluster, the usual recommendation is to run an SEP job per node under mpirun, generating a .tb5 file for each node. If you are able to make an X window or VNC connection to the cluster, you can use PTU to generate the command string and to analyze the results. PTU and SEP are available for download and discussion on their section of the WhatIf forum; they use your VTune license. As with any VTune event sampling, you would build the application with debug symbols (assuming you are analyzing that rather than the MPI itself), and choose that application as module of interest. If it is an open source MPI, you can build the libraries with symbols to facilitate their inclusion in your analysis. You may wish to form a strategy about the MPI waits, such as setting the spin wait time-out to a high value so that the CPU stays busy during waits. The waits will vary greatly among ranks, even if the work is well balanced.