Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4974 Discussions

Profiling with optimization enabled

petersen__Kenni_dine
2,377 Views

I have encountered a problem when using VTune to find hotspots in a large program I am developing. It appears that VTune is not capable of properly identifying names of functions comprising hotspots and their associated call stacks when the code is compiled with optimization enabled (e.g. /O2).

To illustrate the problem, I have generated a simple example:

#include <stdio.h>
#include <omp.h>

double largesum(int Ncores, long long int Nterms, double* result) {
	double sum = 0;
	double time;
	omp_set_num_threads(Ncores);
	time = omp_get_wtime();
#pragma omp parallel for reduction(+ : sum)
	for (long long int i = 1; i <= Nterms; i++) {
		double id=(double)i;
		sum += 1.0 / (id * id);
	}
	time = omp_get_wtime() - time;
	*result = sum;
	return time;
}

int main()
{
	double result, time,Ncores;
	long long int Nterms;
	Ncores = 1;
	Nterms = (long long int)1e8;
	time=largesum(Ncores, Nterms, &result);
	printf("Result: %f\nTime: %f\n", result, time);
}

I.e. the program calculates the value of large convergent sum with 1e8 terms. The sum is calculated in the function largesum()

I can first compile the code (using Windows 10. and version 19.0/19.1 of the Intel Compiler) without optimization. E.g:

icl Speedtestconsole.cpp /Od /Qopenmp /Zi

When analysing the result in VTune "Hotspots", it can be seen that largesum() is recognized as a CPU-intensive function:

Udklip.PNG

However, compiling the program with optimization:

icl Speedtestconsole.cpp /O2 /Qopenmp /Zi

results in this:

petersen__Kenni_dine_0-1601635325057.png

 

In this case, the function largesum() is never identified.

I also tried adding more of the recommended compiler settings (/MD, /debug:full or /debug:inline-debug-info ), but without any luck. I would really like to be able with optimizations enabled.

Does anybody know what the problem might be?

 

 

 

 

0 Kudos
1 Solution
Kirill_U_Intel
Employee
2,262 Views

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

View solution in original post

0 Kudos
18 Replies
RaeesaM_Intel
Moderator
2,352 Views

Hi ,


Thank you for posting in Intel Forum.

We are trying to reproduce the issue from our end. We will get back to you soon.


Raeesa


0 Kudos
Kirill_U_Intel
Employee
2,347 Views

Hi.

Looks like compiler inlined this function.

I've added 

__declspec(noinline)
double largesum(int Ncores, long long int Nterms, double* result) {

....

To your sample and got the stack

Kirill_U_Intel_0-1601906202393.png

 

petersen__Kenni_dine
2,331 Views

Dear Kirill,

 

Thank you for your response. This works and makes good sense.

 

In a more complicated setting, where a program might contain calls to thousands of different functions, what would then be a good profiling approach?

 

I can see that it is possible to disable all inlining with /Ob0 (or -fno-inline), and that does enables me to see the workload of individual functions. However, I imagine that inlining could affect performance quite a bit, and that results from profiling with or without inlining would likely be similarly affected.

 

Therefore, I think it would be quite meaningful to somehow be able to see the (virtual) workload of inlined functions. Profiling a program with inling disabled might give a wrong picture of where the workload is located. 

 

Does that make sense, or am I asking for something that is simply not possible?

 

Thanks,

Kenni

0 Kudos
Kirill_U_Intel
Employee
2,325 Views
0 Kudos
petersen__Kenni_dine
2,319 Views

Dear Kirill,

I gave it a shot, but apparently, I was unable to find the inlined function in Vtune. I used the following compiler command:

icl Speedtestconsole.cpp /O2 /debug:inline-debug-info /Qopenmp /Zi

Should, this be done differently?

Cheers,

Kenni

0 Kudos
Kirill_U_Intel
Employee
2,314 Views

Hm, I used the same 'icl Speedtestconsole.cpp /O2 /Qopenmp /Zi /debug:inline-debug-info'

Kirill_U_Intel_0-1602071840250.png

 

0 Kudos
petersen__Kenni_dine
2,308 Views

Strange. I tried to use the exact same compiler command, and Vtune shows:

petersen__Kenni_dine_0-1602072750198.png

If I invoke the "__declspec(noinline)" directive and use the same compilation option, I do see the function:

petersen__Kenni_dine_1-1602072869481.png

Vtune version 2019 Update 6.

 

 

0 Kudos
Kirill_U_Intel
Employee
2,302 Views

Tried the same VTne version, inline was in stack.

Looks like that depends on compiler.

what is your compiler version?

icl on my side
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.228 Build 20190417
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.

 

I think 

0 Kudos
petersen__Kenni_dine
2,291 Views

Hi Kirill,

I have tried testing this on two different machines with slightly different machines with different versions:

#1:

Intel compiler: 19.1.2.254

Vtune: 2020 Update 2

 

#2:

Intel compiler: 19.0.5.281

Vtune: 2019 Update 6

 

In both cases, I was unable to see the inlined function with the suggested compiler/linker settings.

 

Cheers,

Kenni

 

0 Kudos
Kirill_U_Intel
Employee
2,279 Views

Hi, Kenni.

Could you attach your binaries (exe + pdb)? I'll try on my side.

Thanks, Kirill

0 Kudos
petersen__Kenni_dine
2,275 Views

Hi Kirill,

Thank you for looking into this.

Binaries are attached.

Cheers,

Kenni

0 Kudos
Kirill_U_Intel
Employee
2,263 Views

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

0 Kudos
petersen__Kenni_dine
2,248 Views

Thanks Kirill,

 

That makes sense. I did not realize until now that the command-line I invoke from the shortcut built by the Intel Parallel Studio sets up the compiler environment to use the VS linker.

 

Since my original problem (and not the small example considered here), involves a Windows application that I develop using VS, I suppose I will have to wait and see if outputting inline-debug-information is something MS decides to fix at some point.

 

But thank you very much. I still learned a lot

 

Cheers,

Kenni

Denis_M_Intel
Employee
2,254 Views

/debug:inline-debug-info option is deprecated; it is only valuable with old versions of Visual Studio and Intel Compiler. The debugging information generated for inline functions by modern versions of VS and Intel Compilers is not supported by VTune.


BTW, largesum is not a CPU-intensive function in this sample; It has 0 CPU Time (see the CPU Time:Self column); most of the work is done in the OpenMP parallel region.

0 Kudos
petersen__Kenni_dine
2,238 Views

Hi Denis,

 

Ok, thank you for pointing that out. Is there some other recommended approach than using VTune then?

 

Regarding the function being CPU-intensive: Right, it is not in the presented version, but increase Nterms to, say, (long long int)1e10 or even higher, and it should start to be CPU-intensive, right?

 

BTW, I just tried to run with 1e11 terms in the sum and got this result:

 
 

Unavngivet.png

It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.

 

0 Kudos
Denis_M_Intel
Employee
2,210 Views

I think there will be a separate internal function created for the OpenMP loop: largesum$omp$parallel_for and most of CPU time will be attributed to it. Increasing Nterms shouldn't affect largesum because the loop with all computations will be moved to  largesum$omp$parallel_for.

0 Kudos
Bernard
Valued Contributor I
2,187 Views

>>>It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.>>>

You should look at ICC generated assembly for those two optimization levels. For such a high number of loop trip count (1e11) the convergence of precision in sampling mode should be good and you can look at assembly and see which lines of machine code were marked as  hotspot "contributors".

 

0 Kudos
RaeesaM_Intel
Moderator
2,151 Views

Hi,


Glad that your issue got resolved. We are discontinuing monitoring this thread. Please raise a new thread if you have any further queries.


Raeesa


0 Kudos
Reply