Solved: Question on Trace Analyzer when applied to a MPI program

Abelard · ‎02-29-2024

I have a coupled model which runs parallel with more than 1000 cores on a super computer. I tried to use Trace Analyzer to trace and collect useful performance analysis data. However, it turns out that when I use less than 1000 cores, the whole program can be finished. For example, it takes less than 20 minutes to get the result with 400 cores, and about 40 minutes to get the results with 800 cores. However, when I use 1080 cores, the model was finished quickly, but the Analyzer keeps collecting data and can not stop for more than 3 hours!

Does anybody have any idea on this?

Heinrich_B_Intel · ‎03-04-2024

Hi Abelard,

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N> your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0

View solution in original post

Abelard · ‎02-29-2024

By the way, I found that there is an option "STOPFILE-NAME" in the user guide of the Intel Trace Analyzer, will this be helpful? Maybe I can use it at certain condition and kill the whole program and write a trace file?

Does anybody have experience on it?

TobiasK · ‎03-01-2024

@Abelard
Without a reproducer, an error message or something else we can work with, it's hard to figure out what is going on your side.

Abelard · ‎03-03-2024

Hi Tobias,

There is no error message. We think it is just because the analyzer takes much more time to collect information from more than 1000 cores.

Abelard · ‎03-03-2024

We'd like to try if the "STOPFILE-NAME" option would work.

Heinrich_B_Intel · ‎03-04-2024

Hi Abelard,

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N> your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0

Abelard · ‎03-04-2024

Thanks! I will try it on my program.

Abelard · ‎03-06-2024

Hi Heinrich,

I have tried the steps as you suggested. Everything works fine until step 5. I tried touch ~/STOP, nothing happened. The analyzer continues to collect data. Maybe the route is not searched by the Analyzer?

The only difference is that I submit my job through sbatch xxx.sh

A stupid question: is the STOP file an empty file?

Abelard

Question on Trace Analyzer when applied to a MPI program

IO

MPI

Performance