Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Good tool to for performance ealuation of MPI+OMP code

drMikeT
New Contributor I
958 Views
Hello,

I was wondering if there are Intel tools which allow the performance monitoring and tuning of hybrid MPI + OMP (or multi-threaded) code.

Suppose I have MPI code which also makes calls to multi-threaded MKL routines so that each MPI task really consists of a number of MKL threads.

What would be a good way of investigating the performance of this hybrid code? Can I collect perfomance data with h/w perf. counters for each task and then combine it to get the complete picture afterwards ?

I am familiar with Intel MPI trace analyzer and Vtune tools but it is not clear to me how I could combine thread+task performance observations for an entire hybrid MPI+OMP/MKL code.

May I for instance use mpirun to start the vtune command line binary which in turn launches regular MPI code and then combine the results?

I would be most useful to be able to combine h/w perf counter data / thread with those of all threads in a hybrid MPI code ...


thanks -- michael
0 Kudos
1 Solution
James_T_Intel
Moderator
958 Views
Hi Michael,

Using VTune with a non-Intel MPI should not cause an issue, as VTune does not directly profile MPI calls differently from any other function. For Trace Collector, you should be able to use another MPI implementation by linking to the appropriate libraries. Generally, this is done (in Linux*) by using:

[bash]mpicc ctest.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o ctest mpif77 ftest.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o ftest[/bash]
If you are using the C++ MPI API (rather than the C API), you will need to map the C++ calls to the C calls before linking the VT library. For the Intel MPI Library, this is done using:


[bash]Intel MPI Library and gcc* < 3.0: -lmpigc Intel MPI Library and gcc 3.0 and < 3.4: -lmpigc3 Intel MPI Library and gcc 3.4: -lmpigc4
[/bash]
Collecting from a different MPI implementation requires binary compatibility with the Intel MPI Library. In order to check this, there is a program provided, /examples/mpiconstants.c, that will display certain key parameters. Compare the output from this program (compiled using the desired MPI implementation) to Table 1.1 in the Intel Trace Collector Reference Guide to determine if the implementation is compatible.

I am currently checking on what is supported in your version of VTune, and I'll update this when I have more information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

View solution in original post

0 Kudos
10 Replies
TimP
Honored Contributor III
958 Views
You have several options.
Often, a good place to start is by running the job under VTune on a single node. If you run mpirun underneath VTune, the default module of interest would be mpirun, but you can adjust that according to your requirements. You would need to set a reasonable affinity, keeping OpenMP thread cache locality. Single MPI rank OpenMP 1 and 2 thread runs may show you useful information on where you have serial and threaded hot spots.
For running VTune across a cluster, you should set up VTune command line to run under mpirun, to collect a .tb6 file for each rank.
0 Kudos
drMikeT
New Contributor I
958 Views
Hi Tim, I was wondering if I have to install the entire VTune suite on each cluster node in order to do the perf. data collection at the command line.

What should be the VTune command line to run mpi code ?

mpirun -np N .... vtl .... mpibinary ...

Do you have any suggestions about the command line options above? Does VTune process together all perf collection files belonging to the same MPI run?

thanks ...
Michael
0 Kudos
James_T_Intel
Moderator
958 Views
Hi Michael,

Either tool is suitable for collecting performance data from a hybrid parallel program.

In order to run Intel VTune on an MPI job, here is what you will need to use:

mpirun -genvall -n -l amplxe-cl -r my_result -quiet -collect my_app [my_app_ options]

This will run each process through VTune, and you can then analyze each one. Each process will be collected into a different folder by appending . onto the result name. Full details are available in the documentation files (/opt/intel/vtune_amplifier_xe/documentation/en/help/index.htm, in the contents go to the section titled MPI Analysis Support)

For the Intel Trace Analyzer and Collector, results will automatically be sorted by process and thread. If you are not directly instrumenting your code, there should be no change needed in how you compile or run. If you are instrumenting, there are additional functions to control how the threads are handled.

  • VT_registerthread - Registers a thread with a specified index
  • VT_registernamed - Registers a thread with a specified name and index
  • VT_getthrank - Gets the thread index
  • VT_getthreadid - Gets a global ID for a specific thread within a specific rank

These are described in detail in the Intel Trace Collector Reference Guide, which should be in your installation folder (/opt/intel/itac//doc/ITC_Reference_Guide.pdf by default on Linux*).

Hopefully this will help you get started. Please let me know if you have any further questions or concerns.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

0 Kudos
drMikeT
New Contributor I
958 Views
Hi Michael,

Either tool is suitable for collecting performance data from a hybrid parallel program.

In order to run Intel VTune on an MPI job, here is what you will need to use:

mpirun -genvall -n -l amplxe-cl -r my_result -quiet -collect my_app [my_app_ options]

This will run each process through VTune, and you can then analyze each one. Each process will be collected into a different folder by appending . onto the result name. Full details are available in the documentation files (/opt/intel/vtune_amplifier_xe/documentation/en/help/index.htm, in the contents go to the section titled MPI Analysis Support)

For the Intel Trace Analyzer and Collector, results will automatically be sorted by process and thread. If you are not directly instrumenting your code, there should be no change needed in how you compile or run. If you are instrumenting, there are additional functions to control how the threads are handled.

  • VT_registerthread - Registers a thread with a specified index
  • VT_registernamed - Registers a thread with a specified name and index
  • VT_getthrank - Gets the thread index
  • VT_getthreadid - Gets a global ID for a specific thread within a specific rank

These are described in detail in the Intel Trace Collector Reference Guide, which should be in your installation folder (/opt/intel/itac//doc/ITC_Reference_Guide.pdf by default on Linux*).

Hopefully this will help you get started. Please let me know if you have any further questions or concerns.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

your answer is very clear and concise. I guess the "amplxe-cl" is part of the latest Intel VTune Amplifier XE version of vtune, correct ?

At the moment we only have the older VTune v9.0 installed. Could I do something similar with this older VTune version?

thanks ... Michael
0 Kudos
drMikeT
New Contributor I
958 Views
Hi James,

could we use the Intel tools (such as TraceCollector/Analyzer and VTune) with non-Intel MPI? There are requirements for MPI application code forcing us to use other MPI stacks but we still use all other Intel s/w componentes: compilers, MKL/OMP, etc.

Could I get some hints on how to use ITAC and VTune with non-Intel MPI?

thanks ... Michael
0 Kudos
James_T_Intel
Moderator
959 Views
Hi Michael,

Using VTune with a non-Intel MPI should not cause an issue, as VTune does not directly profile MPI calls differently from any other function. For Trace Collector, you should be able to use another MPI implementation by linking to the appropriate libraries. Generally, this is done (in Linux*) by using:

[bash]mpicc ctest.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o ctest mpif77 ftest.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o ftest[/bash]
If you are using the C++ MPI API (rather than the C API), you will need to map the C++ calls to the C calls before linking the VT library. For the Intel MPI Library, this is done using:


[bash]Intel MPI Library and gcc* < 3.0: -lmpigc Intel MPI Library and gcc 3.0 and < 3.4: -lmpigc3 Intel MPI Library and gcc 3.4: -lmpigc4
[/bash]
Collecting from a different MPI implementation requires binary compatibility with the Intel MPI Library. In order to check this, there is a program provided, /examples/mpiconstants.c, that will display certain key parameters. Compare the output from this program (compiled using the desired MPI implementation) to Table 1.1 in the Intel Trace Collector Reference Guide to determine if the implementation is compatible.

I am currently checking on what is supported in your version of VTune, and I'll update this when I have more information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
drMikeT
New Contributor I
958 Views
James this was very useful answer to me.

I am trying to install the VTune amplifier for Linux but this will take place after we update the iDP cluser we have to a more recent kernel.

BTW does any of the S/W in "What-If" section provide any more useful or in-depth informaton than VTuneAplifier and ITAC?

thanks again ... Michael

0 Kudos
James_T_Intel
Moderator
958 Views
Hi Michael,

I have not gone through the offerings in the What If section, so I cannot say if what is there would be suitable for what you are seeking. At a glance, it looks like the Intel Performance Tuning Utility would provide some additional features to VTune, but I see nothing related to MPI or hybrid programming.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
drMikeT
New Contributor I
958 Views
Thanks, I am evaluating the IPTU ...
Michael
0 Kudos
James_T_Intel
Moderator
958 Views
Hi Michael,

I've tested using Intel VTune Performance Analyzer 9.0 along with an MPI program. It does not appear to work (MPI_Init failed). While it is very likely possible to use it with MPI, it will be much easier with the current version.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Reply