Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
343 Views

Profiling with optimization enabled

Jump to solution

I have encountered a problem when using VTune to find hotspots in a large program I am developing. It appears that VTune is not capable of properly identifying names of functions comprising hotspots and their associated call stacks when the code is compiled with optimization enabled (e.g. /O2).

To illustrate the problem, I have generated a simple example:

#include <stdio.h>
#include <omp.h>

double largesum(int Ncores, long long int Nterms, double* result) {
	double sum = 0;
	double time;
	omp_set_num_threads(Ncores);
	time = omp_get_wtime();
#pragma omp parallel for reduction(+ : sum)
	for (long long int i = 1; i <= Nterms; i++) {
		double id=(double)i;
		sum += 1.0 / (id * id);
	}
	time = omp_get_wtime() - time;
	*result = sum;
	return time;
}

int main()
{
	double result, time,Ncores;
	long long int Nterms;
	Ncores = 1;
	Nterms = (long long int)1e8;
	time=largesum(Ncores, Nterms, &result);
	printf("Result: %f\nTime: %f\n", result, time);
}

I.e. the program calculates the value of large convergent sum with 1e8 terms. The sum is calculated in the function largesum()

I can first compile the code (using Windows 10. and version 19.0/19.1 of the Intel Compiler) without optimization. E.g:

icl Speedtestconsole.cpp /Od /Qopenmp /Zi

When analysing the result in VTune "Hotspots", it can be seen that largesum() is recognized as a CPU-intensive function:

Udklip.PNG

However, compiling the program with optimization:

icl Speedtestconsole.cpp /O2 /Qopenmp /Zi

results in this:

petersen__Kenni_dine_0-1601635325057.png

 

In this case, the function largesum() is never identified.

I also tried adding more of the recommended compiler settings (/MD, /debug:full or /debug:inline-debug-info ), but without any luck. I would really like to be able with optimizations enabled.

Does anybody know what the problem might be?

 

 

 

 

0 Kudos

Accepted Solutions
Highlighted
Employee
228 Views

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

View solution in original post

0 Kudos
18 Replies
Highlighted
Moderator
318 Views

Hi ,


Thank you for posting in Intel Forum.

We are trying to reproduce the issue from our end. We will get back to you soon.


Raeesa


0 Kudos
Highlighted
Employee
313 Views

Hi.

Looks like compiler inlined this function.

I've added 

__declspec(noinline)
double largesum(int Ncores, long long int Nterms, double* result) {

....

To your sample and got the stack

Kirill_U_Intel_0-1601906202393.png

 

Highlighted
297 Views

Dear Kirill,

 

Thank you for your response. This works and makes good sense.

 

In a more complicated setting, where a program might contain calls to thousands of different functions, what would then be a good profiling approach?

 

I can see that it is possible to disable all inlining with /Ob0 (or -fno-inline), and that does enables me to see the workload of individual functions. However, I imagine that inlining could affect performance quite a bit, and that results from profiling with or without inlining would likely be similarly affected.

 

Therefore, I think it would be quite meaningful to somehow be able to see the (virtual) workload of inlined functions. Profiling a program with inling disabled might give a wrong picture of where the workload is located. 

 

Does that make sense, or am I asking for something that is simply not possible?

 

Thanks,

Kenni

0 Kudos
Highlighted
Employee
291 Views

Hi,

To stay optimization and get inlines in stacks, I suggest to use /debug:inline-debug-info to generate debug info for inlines.

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performanc...

Kirill

0 Kudos
Highlighted
285 Views

Dear Kirill,

I gave it a shot, but apparently, I was unable to find the inlined function in Vtune. I used the following compiler command:

icl Speedtestconsole.cpp /O2 /debug:inline-debug-info /Qopenmp /Zi

Should, this be done differently?

Cheers,

Kenni

0 Kudos
Highlighted
Employee
280 Views

Hm, I used the same 'icl Speedtestconsole.cpp /O2 /Qopenmp /Zi /debug:inline-debug-info'

Kirill_U_Intel_0-1602071840250.png

 

0 Kudos
Highlighted
274 Views

Strange. I tried to use the exact same compiler command, and Vtune shows:

petersen__Kenni_dine_0-1602072750198.png

If I invoke the "__declspec(noinline)" directive and use the same compilation option, I do see the function:

petersen__Kenni_dine_1-1602072869481.png

Vtune version 2019 Update 6.

 

 

0 Kudos
Highlighted
Employee
268 Views

Tried the same VTne version, inline was in stack.

Looks like that depends on compiler.

what is your compiler version?

icl on my side
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.228 Build 20190417
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.

 

I think 

0 Kudos
Highlighted
257 Views

Hi Kirill,

I have tried testing this on two different machines with slightly different machines with different versions:

#1:

Intel compiler: 19.1.2.254

Vtune: 2020 Update 2

 

#2:

Intel compiler: 19.0.5.281

Vtune: 2019 Update 6

 

In both cases, I was unable to see the inlined function with the suggested compiler/linker settings.

 

Cheers,

Kenni

 

0 Kudos
Highlighted
Employee
245 Views

Hi, Kenni.

Could you attach your binaries (exe + pdb)? I'll try on my side.

Thanks, Kirill

0 Kudos
Highlighted
241 Views

Hi Kirill,

Thank you for looking into this.

Binaries are attached.

Cheers,

Kenni

0 Kudos
Highlighted
Employee
229 Views

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

View solution in original post

0 Kudos
Highlighted
Employee
220 Views

/debug:inline-debug-info option is deprecated; it is only valuable with old versions of Visual Studio and Intel Compiler. The debugging information generated for inline functions by modern versions of VS and Intel Compilers is not supported by VTune.


BTW, largesum is not a CPU-intensive function in this sample; It has 0 CPU Time (see the CPU Time:Self column); most of the work is done in the OpenMP parallel region.

0 Kudos
Highlighted
214 Views

Thanks Kirill,

 

That makes sense. I did not realize until now that the command-line I invoke from the shortcut built by the Intel Parallel Studio sets up the compiler environment to use the VS linker.

 

Since my original problem (and not the small example considered here), involves a Windows application that I develop using VS, I suppose I will have to wait and see if outputting inline-debug-information is something MS decides to fix at some point.

 

But thank you very much. I still learned a lot

 

Cheers,

Kenni

Highlighted
204 Views

Hi Denis,

 

Ok, thank you for pointing that out. Is there some other recommended approach than using VTune then?

 

Regarding the function being CPU-intensive: Right, it is not in the presented version, but increase Nterms to, say, (long long int)1e10 or even higher, and it should start to be CPU-intensive, right?

 

BTW, I just tried to run with 1e11 terms in the sum and got this result:

 
 

Unavngivet.png

It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.

 

0 Kudos
Highlighted
Employee
176 Views

I think there will be a separate internal function created for the OpenMP loop: largesum$omp$parallel_for and most of CPU time will be attributed to it. Increasing Nterms shouldn't affect largesum because the loop with all computations will be moved to  largesum$omp$parallel_for.

0 Kudos
Highlighted
Black Belt
153 Views

>>>It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.>>>

You should look at ICC generated assembly for those two optimization levels. For such a high number of loop trip count (1e11) the convergence of precision in sampling mode should be good and you can look at assembly and see which lines of machine code were marked as  hotspot "contributors".

 

0 Kudos
Highlighted
Moderator
118 Views

Hi,


Glad that your issue got resolved. We are discontinuing monitoring this thread. Please raise a new thread if you have any further queries.


Raeesa


0 Kudos