- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a coupled model which runs parallel with more than 1000 cores on a super computer. I tried to use Trace Analyzer to trace and collect useful performance analysis data. However, it turns out that when I use less than 1000 cores, the whole program can be finished. For example, it takes less than 20 minutes to get the result with 400 cores, and about 40 minutes to get the results with 800 cores. However, when I use 1080 cores, the model was finished quickly, but the Analyzer keeps collecting data and can not stop for more than 3 hours!
Does anybody have any idea on this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Abelard,
you may use the STOPFILE option in the following way:
1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.
3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)
4. $ mpirun -n <N> your_program
Run your program as usual
5. During run time of your program touch your stop file:
$ touch ~/stop
6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.
best,
Heinrich
PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, I found that there is an option "STOPFILE-NAME" in the user guide of the Intel Trace Analyzer, will this be helpful? Maybe I can use it at certain condition and kill the whole program and write a trace file?
Does anybody have experience on it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Abelard
Without a reproducer, an error message or something else we can work with, it's hard to figure out what is going on your side.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tobias,
There is no error message. We think it is just because the analyzer takes much more time to collect information from more than 1000 cores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We'd like to try if the "STOPFILE-NAME" option would work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Abelard,
you may use the STOPFILE option in the following way:
1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.
3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)
4. $ mpirun -n <N> your_program
Run your program as usual
5. During run time of your program touch your stop file:
$ touch ~/stop
6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.
best,
Heinrich
PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I will try it on my program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Heinrich,
I have tried the steps as you suggested. Everything works fine until step 5. I tried touch ~/STOP, nothing happened. The analyzer continues to collect data. Maybe the route is not searched by the Analyzer?
The only difference is that I submit my job through sbatch xxx.sh
A stupid question: is the STOP file an empty file?
Abelard

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page