Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2272 Discussões

Question on Trace Analyzer when applied to a MPI program

Abelard
Novato
2.242 Visualizações

I have a coupled model which runs parallel with more than 1000 cores on a super computer. I tried to use Trace Analyzer to trace and collect useful performance analysis data. However, it turns out that when I use less than 1000 cores, the whole program can be finished. For example, it takes less than 20 minutes to get the result with 400 cores, and about 40 minutes to get the results with 800 cores. However, when I use 1080 cores, the model was finished quickly, but the Analyzer keeps collecting data and can not stop for more than 3 hours! 

 

Does anybody have any idea on this? 

Etiquetas (3)
0 Kudos
1 Solução
Heinrich_B_Intel
Funcionário
2.156 Visualizações

Hi Abelard, 

 

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC 
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband. 

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N>  your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program. 
    look for the *.stf file! This can now be inspected by the trace analyzer GUI. 

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0 

Ver solução na publicação original

7 Respostas
Abelard
Novato
2.241 Visualizações

By the way, I found that there is an option "STOPFILE-NAME" in the user guide of the Intel Trace Analyzer, will this be helpful? Maybe I can use it at certain condition and kill the whole program and write a trace file? 

 

Does anybody have experience on it?

 

 

TobiasK
Moderador
2.222 Visualizações

@Abelard 
Without a reproducer, an error message or something else we can work with, it's hard to figure out what is going on your side.

Abelard
Novato
2.175 Visualizações

Hi Tobias,

 

There is no error message. We think it is just because  the analyzer takes much more time to collect information from more than 1000 cores. 

 

 

Abelard
Novato
2.175 Visualizações

We'd like to try if the "STOPFILE-NAME" option would work.

Heinrich_B_Intel
Funcionário
2.157 Visualizações

Hi Abelard, 

 

you may use the STOPFILE option in the following way:

1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC 
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband. 

3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)

4. $ mpirun -n <N>  your_program
Run your program as usual

5. During run time of your program touch your stop file:
$ touch ~/stop

6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program. 
    look for the *.stf file! This can now be inspected by the trace analyzer GUI. 

best,

Heinrich

PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0 

Abelard
Novato
2.143 Visualizações

Thanks! I will try it on my program.

Abelard
Novato
2.099 Visualizações

Hi Heinrich,

 

I have tried the steps as you suggested. Everything works fine until step 5. I tried touch ~/STOP, nothing happened.  The analyzer continues to collect data. Maybe the route is not searched by the Analyzer?

 

The only difference is that I submit my job through sbatch  xxx.sh 

 

A stupid question: is the STOP file an empty file?

 

Abelard

Responder