- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I have a coupled model which runs parallel with more than 1000 cores on a super computer. I tried to use Trace Analyzer to trace and collect useful performance analysis data. However, it turns out that when I use less than 1000 cores, the whole program can be finished. For example, it takes less than 20 minutes to get the result with 400 cores, and about 40 minutes to get the results with 800 cores. However, when I use 1080 cores, the model was finished quickly, but the Analyzer keeps collecting data and can not stop for more than 3 hours!
Does anybody have any idea on this?
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi Abelard,
you may use the STOPFILE option in the following way:
1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.
3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)
4. $ mpirun -n <N> your_program
Run your program as usual
5. During run time of your program touch your stop file:
$ touch ~/stop
6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.
best,
Heinrich
PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
By the way, I found that there is an option "STOPFILE-NAME" in the user guide of the Intel Trace Analyzer, will this be helpful? Maybe I can use it at certain condition and kill the whole program and write a trace file?
Does anybody have experience on it?
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
@Abelard
Without a reproducer, an error message or something else we can work with, it's hard to figure out what is going on your side.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi Tobias,
There is no error message. We think it is just because the analyzer takes much more time to collect information from more than 1000 cores.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
We'd like to try if the "STOPFILE-NAME" option would work.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi Abelard,
you may use the STOPFILE option in the following way:
1. $ export VT_LOGFILE_FORMAT=stfsingle
this is just for convenience. All data will be written into a single file.
2. $ export LD_PRELOAD=libVTfs.so
this is needed to use the Fail Save version of ITAC
Fail save lib uses an alternative network e.g. TCP when the original program runs infiniband.
3. $ export VT_STOPFILE_NAME=~/stop
Look for a stop file named "stop" in your home dir (you can choose another path/name)
4. $ mpirun -n <N> your_program
Run your program as usual
5. During run time of your program touch your stop file:
$ touch ~/stop
6. When ITAC fail save lib detects the stop file, it will write a premature ITAC trace file and stop your program.
look for the *.stf file! This can now be inspected by the trace analyzer GUI.
best,
Heinrich
PS: there might be issues with the fail save lib because libnsl.so.1 is not installed by default on RHEL 8.0
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Thanks! I will try it on my program.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi Heinrich,
I have tried the steps as you suggested. Everything works fine until step 5. I tried touch ~/STOP, nothing happened. The analyzer continues to collect data. Maybe the route is not searched by the Analyzer?
The only difference is that I submit my job through sbatch xxx.sh
A stupid question: is the STOP file an empty file?
Abelard
- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite