We are trying to profile an application with Vtune ( Intel(R) VTune(TM) Amplifier XE 2013 (build 353306) ). It is an MPI application and for now we are running it as a single process mpi job.
We tried snb-access-contention profile with-call-stack(11 GB) and without-call-stack(18GB).
I ran them as
Without call stack: amplxe-cl -r snb-access-contention -collect snb-access-contention -data-limit=0
With call stack: amplxe-cl -r snb-access-contention_cs -collect snb-access-contention -knob enable-stack-collection=true -data-limit=0
The log shows it uses the performance counters as follows with sampling rate in brackets
I also used Hpctoolkit(http://hpctoolkit.org/) with similar sampling rates. E.g. CPU_CLK_UNHALTED:REF_P(2000000)
The data collected is only around 1.5 MB and if I enable tracing which gives a timeline view it goes to 15MB
Then I need to create a program structure file which is around 25Mb this can be kept as a common file for different counters.
Is there some sort of hint why data collected is comparably so huge for vtune? Ours is a sandybridge machine.
I cannot use the pause/resume API because I cannot change the source code. https://software.intel.com/en-us/articles/how-to-call-resume-and-pause-api-from-fortran-code
I would like to clarify -
1. I think that you are using VTune Amplifier XE 2013 Update 17. I support that you have to specify target application (launch or attach), note that system wide profiling doesn't support stack sampling.
# amplxe-cl -collect snb-access-contention -knob enable-stack-collection=true -data-limit=0 -- /bin/ls
2. Are you sure that two sessions ran same duration time? If so, the size of result directory with call stack should be bigger. Please check their elapsed time in summary report.
> We tried snb-access-contention profile with-call-stack(11 GB) and without-call-stack(18GB).
3. If you have no opportunity to insert VTune APIs in code, you can use another console by using "amplxe-cl -command xxxx" to control.
4. I don't understand HpcToolkit, and no comment on result size. But VTune(TM) Amplifier XE will collect more rich data so the developers don't specify duration time too long.
>> 2. Are you sure that two sessions ran same duration time? If so, the size of result directory with call stack should be bigger. Please check their elapsed time in summary report.
Looks like Stack collection knob selects a different sampling driver named VTSS or VTSS++. The usual sampling driver is SEP (sampling enabling product). My be that is causing the difference in size. The time elapsed is almost same in both cases(almost 1650 sec)
It seems that I can reproduce this "problem". Two sessions used same elapsed time, it should not be driver relevant...and I regard that collection with stack will collect more data, but result directory is smaller. BUT IT DOES MAKE SENSE!
A. Sampling with stack collection = regular sampling + stack walk ; spent more time in ISR, less samples captured
B. Sampling without stack collection = regular sampling
If you do more work in case A, with same elapsed time, you will get less regular samples. That is why result directory of A is smaller than B's
# source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
Copyright (C) 2009-2014 Intel Corporation. All rights reserved.
Intel(R) VTune(TM) Amplifier XE 2013 (build 350583)
# lsmod | grep vtsspp
vtsspp 356583 0
vtsspp driver will be used only for stack sampling, plus sep3 driver
# lsmod | grep sep
sep3_15 520122 0
sep3 driver is only general sampling, without stack collection.
# amplxe-cl -collect snb-access-contention -knob enable-stack-collection=true -r snb-access-contention_cs -duration 60 -- ./mem_demandon
# amplxe-cl -collect snb-access-contention -knob enable-stack-collection=false -r snb-access-contention -duration 60 -- ./mem_demandon
# du -sh snb-access-contention
# du -sh snb-access-contention_cs