hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

traceanalyzer out of memory

Patrick_Mulrooney
3,149 Views

When I try to launch traceanalyzer for a large trace (~200G) it quits immediately. I see the following...

$ ~/intel/oneapi/itac/latest/bin/traceanalyzer.bin --cli ./cesm.exe.single.stf 

 

>>>>> Welcome to the Intel(R) Trace Analyzer command line interface. <<<<

 

STF ERROR: out of memory (-153063873 byte) [tracing/stf/stf_io.c:499], aborting.

 

It is not obvious if it is actually trying to allocate more memory than available or hitting some other limit.

0 Kudos
8 Replies
HemanthCH_Intel
Moderator
3,126 Views

Hi,


Thank you for posting in Intel Communities.


Could you please provide the sample reproducer code and steps to reproduce your issue?

Could you please confirm whether you are getting an "out of memory" error only with "200GB" of memory or lesser than 200Gb too?

Could you please provide the MPI library version and OS version?


Thanks & Regards,

Hemanth


0 Kudos
Patrick_Mulrooney
3,099 Views

Hemanth,

 

Thanks, and sorry for the slow response.

 


Could you please provide the sample reproducer code and steps to reproduce your issue?

The code being run was CESM2. I do not think it would be possible to provide the steps to reproduce the run (would take a long time and requires a 100 node cluster). I can provide a link to the trace file if you would like to pull it down for testing. The steps to reproduce is to try to open the trace file using the TraceAnalyzer tool. It crashes immediately.

 


Could you please confirm whether you are getting an "out of memory" error only with "200GB" of memory or lesser than 200Gb too?


Only this code, but I only tested the installs on a trace derived from simple HelloWorld style run. I tested this on three different systems including on that had 1.5TB of RAM. I also tested on two systems with significantly different OS installs.


Could you please provide the MPI library version and OS version?

Intel MPI 2020u4

$ cat /etc/redhat-release 

CentOS Linux release 7.9.2009 (Core)

$ uname -a

Linux cluster-ln3 3.10.0-1160.41.1.el7.x86_64 #1 SMP Tue Aug 31 14:52:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

0 Kudos
HemanthCH_Intel
Moderator
3,077 Views

Hi,

 

To reduce the amount of memory being utilized by ITAC, please use the filtering options to check the memory utilized by the specific function or set of functions.

Could you please compile the mpi application and use the below command to run:

 

mpirun -trace -n <num of processess> ./a.out

 

By using the above command, you will generate the *.stf file. 

 

traceanalyzer a.out.stf 

 

Using the above command will open the GUI.

Now follow the instructions from the below URL:

https://www.intel.com/content/www/us/en/develop/documentation/ita-user-and-reference-guide/graphical-user-interface/dialogs/filtering-dialog-box/building-filter-expressions-using-graphical-interface.html

To use the CLI Filteration method, refer to the below link:

https://www.intel.com/content/www/us/en/develop/documentation/ita-user-and-reference-guide/intel-trace-analyzer-reference/command-line-interface-cli.html

 

Thanks & Regards,

Hemanth 

 

0 Kudos
Patrick_Mulrooney
3,061 Views

Hemanth,

 

Thanks for the reply. That is how I generated the trace files I am attempting to look at.

 

I tried using the filtering options, but no matter which ones I try I I still get an out of memory before it does anything (looking at the output of strace).

I tried...

 

~/intel/oneapi/itac/2021.5.0/bin/traceanalyzer --cli --messageprofile --filter="p2pfilter(sender(0))" ./cesm.exe.single.stf

~/intel/oneapi/itac/2021.5.0/bin/traceanalyzer --cli --messageprofile --filter="funcfilter(NONE),p2pfilter(NONE),collfilter(NONE)"  ./cesm.exe.single.stf

~/intel/oneapi/itac/2021.5.0/bin/traceanalyzer --cli --functionprofile --filter="funcfilter(sender(0))" -o messages.txt  ./cesm.exe.single.stf

~/intel/oneapi/itac/2021.5.0/bin/traceanalyzer --cli --collopprofile --filter="funcfilter(NONE),p2pfilter(NONE),collfilter(NONE)"  ./cesm.exe.single.stf

~/intel/oneapi/itac/2021.5.0/bin/traceanalyzer --cli --filter="funcfilter(NONE),p2pfilter(NONE),collfilter(NONE)"  ./cesm.exe.single.stf

 

I get the following from strace the shows the OOM

 

mmap(NULL, 18446742497303490560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 18446742497303621632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2b0e5a69e000
munmap(0x2b0e5a69e000, 26615808)        = 0
munmap(0x2b0e60000000, 40493056)        = 0
mprotect(0x2b0e5c000000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 18446742497303490560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
fstat(5, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0e5703b000
read(5, "TZif2\0\0\0\0\0\0\0...\f\20\2x\v \3q("..., 4096) = 2819
lseek(5, -1802, SEEK_CUR)               = 1017
read(5, "TZif2\0\0\0\0\0\0\0\0\0\0\0\...\20\377\377\377"..., 4096) = 1802
close(5)                                = 0
munmap(0x2b0e5703b000, 4096)            = 0
write(2, "STF ERROR: out of memory (-153063873 byte) [tracing/stf/stf_io.c:499], aborting.\n", 81) = 81
0 Kudos
HemanthCH_Intel
Moderator
3,046 Views

Hi,

 

Thanks for the update.

 

Could you please try any one of the below methods to reduce the size of the trace file?

 

Method 1:

By using the VT.h header file, VT_traceon() and VT_traceoff() we can reduce the memory during the tracing of an application.

#include “VT.h”
MPI_Init();
VT_traceoff();
…
…
…
VT_traceon();  //switch on tracing
…
…
VT_traceoff();  //switch off tracing

Steps for using method 1:

1) Add the VT_traceon and VT_traceoff() functions in the source code where tracing need to be done.

2) Initialize the oneAPI environment using below command:

source /opt/intel/oneapi/setvars.sh

3)Compile the application again using the below command:

mpiicc -trace <file.c/c++> -I $VT_ROOT/include/ -L $VT_ROOT/lib/

4) Run the binary to generate .stf file.

mpirun -n <number of processess> ./<obj file>

5)traceanalyzer <obj file>.stf

 

Method 2:

1)source /opt/intel/oneapi/setvars.sh

2)Use config file to limit tracing of certain events/functions.

STATE “MPI:*” ON         // Switch on all MPI events
STATE “MPI: MPI_COMM_SIZE” OFF // Doesn’t collect data about MPI_COMM_SIZE

We need to create a config file and can alter the functions which need to collect the data(STATE "MPI:xxxxxx" ON/OFF).

3)Recompile the application using additional flags (-tcollect -tcollect-filter /path/to/itac_config)

mpiicc -tcollect -tcollect-filter /path/to/itac_config application

4)Run the binary and it will generate the .stf file

5)traceanalyzer <obj file>.stf

 

Thanks & Regards,

Hemanth

 

0 Kudos
HemanthCH_Intel
Moderator
3,026 Views

Hi,


We haven't heard back from you. Could you please provide any updates on your issue?


Thanks & Regards,

Hemanth.


0 Kudos
Patrick_Mulrooney
3,004 Views

Hemanth,

 

Sorry for the slow reply. Unfortunately we will not be able to test either of those methods due to some complexity of what we are doing. We will have another opportunity to test this in a couple weeks with a different, but similar code.  Please consider this "closed" for the time being, we can revisit if the other code has the same issue.

 

Thanks for all the help, and sorry that we could not test your proposed solutions.

Pat

0 Kudos
HemanthCH_Intel
Moderator
2,991 Views

Hi,

 

Thanks for the update.

>>" Please consider this "closed" for the time being"

We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

 

Thanks & Regards,

Hemanth

 

0 Kudos
Reply