time statistics provided by Intel Trace Analyzer

Michael_Charissis · ‎02-23-2016

I am using the Trace Analyzer for an MPI job running on 4 nodes (80 physical cores total, 80 MPI threads).

When I run 'mpirun -trace ...' the job takes roughly 10 times longer than the same job running without tracing because processes are being suspended when trace routines dump data from memory to disk.

My goal is to identify the relative share of time spent in most active MPI routines but, In this situation, how much one can trust timing statistics displayed by the Trace Analyzer? Isn't it possible that a process starts an MPI_SEND when the target process is suspended by the tracer, so that the transfer does not start until the tracer in the target process has completed the dump and all this artificial wait time is added by the Trace Analyzer to the time spent in the MPI_SEND call?

Thank you for your attention.

Gergana_S_Intel · ‎02-23-2016

Hey Michael,

If all you're looking for is most active MPI routines and time spent in them, I would recommend using the MPI Performance Snapshot which is much more lightweight. Here's a quick getting started guide. Just make sure you're using latest versions of our tools.

Once the info is collected, running with "mps -f <stats_file> <app_stats_file>" will give you all the "hotspot" MPI routines in the code.

Once you've narrowed down where the problem areas are, I would use the Intel Trace Analyzer and Collector with some filtering applied to reduce the amount of data collected and any potential performance impacts.

Regards,
~Gergana

Michael_Charissis · ‎02-24-2016

Thank you, Gergana, I'll try MPS.

Michael_Charissis · ‎02-26-2016

Hi Gergana,

I am trying to use MPS but there is a problem. Here is how I run the job:

source /opt/lic/intel16U2/vtune_amplifier_xe/amplxe-vars.sh
source /opt/lic/intel16U2/itac/9.1.2.024/intel64/bin/mpsvars.sh
source /opt/lic/intel16U2/parallel_studio_xe_2016.2.062/bin/psxevars.sh
source /opt/lic/intel16U2/bin/compilervars.sh intel64
source \
   /opt/lic/intel16U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpivars.sh
mpirun -mps \
  -genv I_MPI_STATS 10 \
  -genv I_MPI_STATS_SCOPE all \
  -genv I_MPI_STATS_FILE mpi_stats_file_2_1.txt \
  -genv I_MPI_PIN 1\
  -genv I_MPI_PIN_PROCS allcores:map=scatter \
  -genv I_MPI_PIN_MODE mpd \
  -genv I_MPI_DEBUG 0 \
  -np 80 \
  --rsh=ssh \
  --file=mpd.hosts \
  /home/ashv/draco_revs/draco_wiindigoo_blizzard_maxMagXBTeffect/draco_cl5_16

When the programs tries to execute MPI_Allgatherv(), I receive multiple error messages as follows:

Fatal error in PMPI_Allgatherv: Invalid datatype, error stack:
PMPI_Allgatherv(1483): MPI_Allgatherv(sbuf=0x1040d44, scount=0, INVALID DATATYPE, rbuf=0x7fff8e4b2010, rcounts=0x7fff8e5163b0, displs=0x7fff8e5164f0, MPI_DOUBLE_PRECISION, MPI_COMM_WORLD) failed
PMPI_Allgatherv(1393): Invalid datatype

When I run the same job without the -mps parameter, the job runs without errors and produces correct results.
The program is compiled in the exactly same environment as shown above. All compilers and tools are the latest available copies.

Thank you for your attention.

Report Inappropriate Content · ‎02-26-2016

Hey Michael,

It seems like MPS doesn't like the datatype you're passing through your MPI_Allgatherv(). Can you tell me what it is? I don't know of a particular problem with datatypes but I'll check with the team.

How large is this application? Might be good to have a local reproducer so I can run it on one of our machines. Would it be possible for you to send the code over? Or a small sample that exhibits the problem?

Thanks,
~Gergana