Hello,

S__Sergey · ‎06-07-2018

Hi. I'm very new one with VTune.
I'm trying to examine a simple programm with two functions, executed one after another, with the same "for" cycle in them (which do some calculations to take more time).

void single()
{
double tmp;
for( int i = 1; i<N; ++i )
{
tmp = (i*12+i/5+i/10)/i+i*2;
for(int j = 0; j<10; ++j)
tmp = (i*12+i/5+i/10)/i+i*2+pow(j,1.35);
}
return;
}

void multi()
{
omp_set_num_threads(4);
double tmp;
#pragma omp parallel for private(tmp)
for( int i = 1; i<N; ++i )
{
tmp = (i*12+i/5+i/10)/i+i*2;
for(int j = 0; j<10; ++j)
tmp = (i*12+i/5+i/10)/i+i*2+pow(j,1.35);
}
return;
}

And according with VTune Basic Hotspots analysis
the one with omp instructions (multi) were executed 2.5 s
and another one (single) were executed 1.96 s

It is strange because my simple std::clock_t begin, end; method proof that parallel one executed 2 times faster
and even VTune core usage graphs show the same proportion (the moment 4 threads are use - the moment multi function execute). But times show something else!

Is it possible to see a real times (or real proportion between this two function)?

S__Sergey · ‎06-07-2018

Here is the main function

int main()
{
	uint N = 10000000;
	double *a = new double[10000000];

	int id;
	std::clock_t begin, end;

	begin = clock();
	single(N, a);
	end = clock();
	std::cout << "single = " << end-begin << std::endl;

	begin = clock();
	multi (N, a);
	end = clock();
	std::cout << "multi = "  << end-begin << std::endl;

    return 0;
}

McCalpinJohn · ‎06-07-2018

I think that the "CPU Time" that VTune is reporting is (an estimate of) the sum of the CPU times across all cores, not the elapsed wall clock time. If this is correct, then it would make sense that the OpenMP loop would use more aggregate CPU time than the corresponding serial code.

Another important consideration is that VTune uses a sampling, rather than interval measurements, so most of the values that are reported are estimates and are subject to some level of statistical uncertainty. (Presumably the count of samples at each program counter is exact, but going from the number of samples to the fraction of execution time is an estimate.)

VTune does support APIs that allow inline control of monitoring, including the definition of new events. Presumably this could include wall-clock time measured at specific places in the program. Documentation of the APIs is at https://software.intel.com/en-us/vtune-amplifier-help-instrumentation-and-tracing-technology-apis

Dmitry_P_Intel1 · ‎06-07-2018

Hello,

John is correct, CPU time aggregates the time spent in all worker threads while you try to compare "elapsed time" of the code execution with your measurement.

Since you use Intel OpenMP one simple way is to switch on "Analyze OpenMP Regions" knob on analysis configuration (-knob analyze-openmp=true in command line). This will switch on parallel region instrumentation and you will be able to see elapsed time of the openmp parallel region and compare with single threaded execution.

Thanks & Regards, Dmitry

S__Sergey · ‎06-07-2018

Thanks for your reply.
I'm try to use "Analyze OpenMP regions".
http://www.picshare.ru/view/8839968/ - screenshot with analysis settings (sampling interval is lowest possible).
Here is the result (unfortunately i couldnt add screenshots directly)
http://www.picshare.ru/view/8839978/

Nothing has changed.
I suppose that it is possible to understand something from coloring view of function executed, but it still impossible to compare result of openmp instructions by executed time.

Dmitry_P_Intel1 · ‎06-07-2018

Could you please switch grouping from "Function/Call Stack" to "OpenMP Region/Function/Call Stack" (this is a drop down box control over the grid).

It will give you timing and other metrics by parallel regions.

Thanks & Regards, Dmitry

S__Sergey · ‎06-07-2018

Thanks. Didnt see this option.
Its result is better, but still strange (if i organise function in a special way, i could be possible to compare, but in real situations):
http://www.picshare.ru/view/8840037/
it clearly say that CPU time for single function 2.103 s
and elapsed time for parallel part in multi function is 3.72s
(should i think it two different times? Then how to compare)

Yes, i could compare elapsed time for openmp and outside region, but it is not comparison between simple for loop and parallel one ( am i wrong?)

It is limitation of parallel programs(that impossible to estimate functions execution time)? (im very new in parallelism)

Thanks

S__Sergey · ‎06-07-2018

I did every test within debug regime. Could it be a reson for such results?

Dmitry_P_Intel1 · ‎06-07-2018

The reason is that bottom up view shows CPU self-time for single function (not including pow CPU time etc).

To see total CPU time for "single" function you can go to Caller/Callee tab and look at CPU Time: Total column.

This will not be exactly elapsed time of single function but taking into account that it should not have waits inside it should be statistically close I expect.

Thanks & Regards, Dmitry

S__Sergey · ‎06-07-2018

Thanks!

Function execution time is wrong !?