Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2162 Discussions

Question on Trace Analyzer when applied to a MPI program

Abelard
Novice
880 Views

I have a coupled model which runs parallel with more than 1000 cores on a super computer. I tried to use Trace Analyzer to trace and collect useful performance analysis data. However, it turns out that when I use less than 1000 cores, the whole program can be finished. For example, it takes less than 20 minutes to get the result with 400 cores, and about 40 minutes to get the results with 800 cores. However, when I use 1080 cores, the model was finished quickly, but the Analyzer keeps collecting data and can not stop for more than 3 hours! 

 

Does anybody have any idea on this? 

Labels (3)
0 Kudos
1 Solution
Heinrich_B_Intel
Employee
794 Views

Hi Abelard, 

 

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC 
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband. 

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N>  your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program. 
    look for the *.stf file! This can now be inspected by the trace analyzer GUI. 

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0 

View solution in original post

0 Kudos
7 Replies
Abelard
Novice
879 Views

By the way, I found that there is an option "STOPFILE-NAME" in the user guide of the Intel Trace Analyzer, will this be helpful? Maybe I can use it at certain condition and kill the whole program and write a trace file? 

 

Does anybody have experience on it?

 

 

0 Kudos
TobiasK
Moderator
860 Views

@Abelard 
Without a reproducer, an error message or something else we can work with, it's hard to figure out what is going on your side.

0 Kudos
Abelard
Novice
813 Views

Hi Tobias,

 

There is no error message. We think it is just because  the analyzer takes much more time to collect information from more than 1000 cores. 

 

 

0 Kudos
Abelard
Novice
813 Views

We'd like to try if the "STOPFILE-NAME" option would work.

0 Kudos
Heinrich_B_Intel
Employee
795 Views

Hi Abelard, 

 

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC 
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband. 

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N>  your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program. 
    look for the *.stf file! This can now be inspected by the trace analyzer GUI. 

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0 

0 Kudos
Abelard
Novice
781 Views
0 Kudos
Abelard
Novice
737 Views

Hi Heinrich,

 

I have tried the steps as you suggested. Everything works fine until step 5. I tried touch ~/STOP, nothing happened.  The analyzer continues to collect data. Maybe the route is not searched by the Analyzer?

 

The only difference is that I submit my job through sbatch  xxx.sh 

 

A stupid question: is the STOP file an empty file?

 

Abelard

0 Kudos
Reply