Estimating elapsed time for a vtune anaysis (knob sampling-interval)

psing51 · ‎04-13-2020

Hi,

I ran a HPCPerformance analysis(vtune 2020u0) on intel8280 (RHEL7.6) with default settings as -

time mpirun -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  $OPTS  amplxe-cl -collect hpc-performance -data-limit 0 -result-dir result_hpcperf -- ${APP_INSTALL_ROOT}/appname.exe

the analysis part

vtune: Executing actions  0 %
........
vtune: Executing actions 100 % done

took around 45 minutes and "result_hpcperf.nodeXX" directory had around 20G data.

Q1: If my linux kernel version is 3.10.0-957.el7.x86_64 then what will be the default sampling interval ?

Q2: If i reduce the sampling interval for an analysis by half, (by rough estimate) how much elapsed time and output data should i expect for the vtune analysis+report generation part ?

- I was expecting that if the sampling interval is halved (default 1ms -> 0.5ms ) , then the analysis & result generation should take around 90 minutes and i was expecting data of around 40-50 GB. Please let me know if my assumptions are incorrect.

Q3: Also, If i reduce the sampling interval for an analysis by half, then (in general based on your observations with this tool) how much accuracy in output data metrics can i expect ?

As per this article (CPU sampling interval, ms field) , i assumed the default sampling interval should be 1ms, and i reran HPC performance analysis by setting sampling-interval to 0.5 ms as -

time mpirun -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  $OPTS  amplxe-cl -collect hpc-performance -data-limit 0 -result-dir result_hpcperf -knob sampling-interval=0.5  -- ${APP_INSTALL_ROOT}/appname.exe

the last statement to appear in the stdout was -

vtune: Executing actions  0 %

and around 11 hours ave elapsed since then and around 150G of data has been generated in results directory.

within the results directory ( find . -printf "%T+\t%p\n" | sort) i saw that the last file was changed around 11 hours ago , and that file has following contents -

[user@headnode01 hpcperf_char_00003]$ cat result_hpcperf.node3/config/log.cfg
<?xml version='1.0' encoding='UTF-8'?>

<bag xmlns:int="http://www.w3.org/2001/XMLSchema#int" xmlns:long="http://www.w3.org/2001/XMLSchema#long">
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953480"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953542"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953687"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953748"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803954281"/>
 <message_entry_t int:status="1" cap="Data collection completed with warnings" msg="Please see warning messages for details. " long:timeStamp="1586809230671">
  <message msg="Analyzing data in the node-wide mode. The hostname (node61) will be added to the result path/name." int:severity="1"/>
  <message msg="Peak bandwidth measurement started." int:severity="1"/>
  <message msg="Peak bandwidth measurement finished." int:severity="1"/>
  <message msg="To enable hardware event-base sampling, VTune Profiler has disabled the NMI watchdog timer. The watchdog timer will be re-enabled after collection completes." int:severity="2"/>
  <message msg="Collection started." int:severity="1"/>
  <message msg="Collection stopped." int:severity="1"/>
 </message_entry_t>
</bag>

also, on the compute node (node3) i checked the running processes via top command -

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
127588 root      20   0 4128520  82480   3308 R 100.0  0.0 563:13.50 sep
    10 root      20   0       0      0      0 S   6.2  0.0   0:22.52 rcu_sched
     1 root      20   0   56068   8276   2620 S   0.0  0.0   0:26.51 systemd

Here also , it seems that the sep command(/driver)has been running since ~9hours with no memory utilization. Not sure if the application/sep driver is running fine. Is there a way to confirm (via system logs/sep driver logs) if the application is running fine?

It would be very helpful for me if i could get an estimate of the time to be taken by this analysis to finish in my scenario?

- Asking as i will adjust the "walltime" for my vtune jobs on my cluster accordingly.

Please let me know if i can provide more information from my end to help you with answers to my queries.

Dmitry_R_Intel1 · ‎04-14-2020

1. Default sampling interval for server is 5 ms

2. Reducing sampling interval in half should result in ~2 times more sampling data. Tracing data - e.g. parallel region instances - won't be affected by this option. The collection elapsed time should not be affected by sampling interval at all. The only variant when it can happen is if too frequent sampling introduce overhead which will slow down your application and this should be avoided of course. If we include post-processing time (when VTune reads collected raw traces and puts them into internal db, which happens just after the collection phase completes) - yes it will increase depending on the size of the collected data. It is impossible to predict how much more time it will take, the dependency may be or may be not linear.

3. For the 45 minutes experiments sampling interval of ~1 ms looks like an overkill. I would use at least 10 ms or even more.

As I said the application execution time should not be affected by sampling interval. If you still see sep process (it is collector process) after many hours this probably means problems on collection phase. You better abort it and start from scratch. I would really avoid running long collections with small sampling intervals - it stress the system too much and significantly increase risk that some errors will happen.

JananiC_Intel · ‎04-17-2020

Hi,

Has your query been resolved?Could you please give us an update?

Thanks.

JananiC_Intel · ‎04-24-2020

Hi,

We are closing this case by assuming that your issue got resolved. Please feel free to raise a new thread if you have further issues.

Thanks.