Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5248 토론

Profiling with optimization enabled

petersen__Kenni_dine
5,505 조회수

I have encountered a problem when using VTune to find hotspots in a large program I am developing. It appears that VTune is not capable of properly identifying names of functions comprising hotspots and their associated call stacks when the code is compiled with optimization enabled (e.g. /O2).

To illustrate the problem, I have generated a simple example:

#include <stdio.h>
#include <omp.h>

double largesum(int Ncores, long long int Nterms, double* result) {
	double sum = 0;
	double time;
	omp_set_num_threads(Ncores);
	time = omp_get_wtime();
#pragma omp parallel for reduction(+ : sum)
	for (long long int i = 1; i <= Nterms; i++) {
		double id=(double)i;
		sum += 1.0 / (id * id);
	}
	time = omp_get_wtime() - time;
	*result = sum;
	return time;
}

int main()
{
	double result, time,Ncores;
	long long int Nterms;
	Ncores = 1;
	Nterms = (long long int)1e8;
	time=largesum(Ncores, Nterms, &result);
	printf("Result: %f\nTime: %f\n", result, time);
}

I.e. the program calculates the value of large convergent sum with 1e8 terms. The sum is calculated in the function largesum()

I can first compile the code (using Windows 10. and version 19.0/19.1 of the Intel Compiler) without optimization. E.g:

icl Speedtestconsole.cpp /Od /Qopenmp /Zi

When analysing the result in VTune "Hotspots", it can be seen that largesum() is recognized as a CPU-intensive function:

Udklip.PNG

However, compiling the program with optimization:

icl Speedtestconsole.cpp /O2 /Qopenmp /Zi

results in this:

petersen__Kenni_dine_0-1601635325057.png

 

In this case, the function largesum() is never identified.

I also tried adding more of the recommended compiler settings (/MD, /debug:full or /debug:inline-debug-info ), but without any luck. I would really like to be able with optimizations enabled.

Does anybody know what the problem might be?

 

 

 

 

0 포인트
1 솔루션
Kirill_U_Intel
5,390 조회수

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

원본 게시물의 솔루션 보기

0 포인트
18 응답
RaeesaM_Intel
중재자
5,480 조회수

Hi ,


Thank you for posting in Intel Forum.

We are trying to reproduce the issue from our end. We will get back to you soon.


Raeesa


0 포인트
Kirill_U_Intel
5,475 조회수

Hi.

Looks like compiler inlined this function.

I've added 

__declspec(noinline)
double largesum(int Ncores, long long int Nterms, double* result) {

....

To your sample and got the stack

Kirill_U_Intel_0-1601906202393.png

 

petersen__Kenni_dine
5,459 조회수

Dear Kirill,

 

Thank you for your response. This works and makes good sense.

 

In a more complicated setting, where a program might contain calls to thousands of different functions, what would then be a good profiling approach?

 

I can see that it is possible to disable all inlining with /Ob0 (or -fno-inline), and that does enables me to see the workload of individual functions. However, I imagine that inlining could affect performance quite a bit, and that results from profiling with or without inlining would likely be similarly affected.

 

Therefore, I think it would be quite meaningful to somehow be able to see the (virtual) workload of inlined functions. Profiling a program with inling disabled might give a wrong picture of where the workload is located. 

 

Does that make sense, or am I asking for something that is simply not possible?

 

Thanks,

Kenni

0 포인트
Kirill_U_Intel
5,453 조회수
0 포인트
petersen__Kenni_dine
5,447 조회수

Dear Kirill,

I gave it a shot, but apparently, I was unable to find the inlined function in Vtune. I used the following compiler command:

icl Speedtestconsole.cpp /O2 /debug:inline-debug-info /Qopenmp /Zi

Should, this be done differently?

Cheers,

Kenni

0 포인트
Kirill_U_Intel
5,442 조회수

Hm, I used the same 'icl Speedtestconsole.cpp /O2 /Qopenmp /Zi /debug:inline-debug-info'

Kirill_U_Intel_0-1602071840250.png

 

0 포인트
petersen__Kenni_dine
5,436 조회수

Strange. I tried to use the exact same compiler command, and Vtune shows:

petersen__Kenni_dine_0-1602072750198.png

If I invoke the "__declspec(noinline)" directive and use the same compilation option, I do see the function:

petersen__Kenni_dine_1-1602072869481.png

Vtune version 2019 Update 6.

 

 

0 포인트
Kirill_U_Intel
5,430 조회수

Tried the same VTne version, inline was in stack.

Looks like that depends on compiler.

what is your compiler version?

icl on my side
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.228 Build 20190417
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.

 

I think 

0 포인트
petersen__Kenni_dine
5,419 조회수

Hi Kirill,

I have tried testing this on two different machines with slightly different machines with different versions:

#1:

Intel compiler: 19.1.2.254

Vtune: 2020 Update 2

 

#2:

Intel compiler: 19.0.5.281

Vtune: 2019 Update 6

 

In both cases, I was unable to see the inlined function with the suggested compiler/linker settings.

 

Cheers,

Kenni

 

0 포인트
Kirill_U_Intel
5,407 조회수

Hi, Kenni.

Could you attach your binaries (exe + pdb)? I'll try on my side.

Thanks, Kirill

0 포인트
petersen__Kenni_dine
5,403 조회수

Hi Kirill,

Thank you for looking into this.

Binaries are attached.

Cheers,

Kenni

0 포인트
Kirill_U_Intel
5,391 조회수

The difference is in VS version.

Tried VS12 and VS14.

Looks like VS14 linker does not save debug info for inlines and stack is without largesum.

Kirill

0 포인트
petersen__Kenni_dine
5,376 조회수

Thanks Kirill,

 

That makes sense. I did not realize until now that the command-line I invoke from the shortcut built by the Intel Parallel Studio sets up the compiler environment to use the VS linker.

 

Since my original problem (and not the small example considered here), involves a Windows application that I develop using VS, I suppose I will have to wait and see if outputting inline-debug-information is something MS decides to fix at some point.

 

But thank you very much. I still learned a lot

 

Cheers,

Kenni

Denis_M_Intel
직원
5,382 조회수

/debug:inline-debug-info option is deprecated; it is only valuable with old versions of Visual Studio and Intel Compiler. The debugging information generated for inline functions by modern versions of VS and Intel Compilers is not supported by VTune.


BTW, largesum is not a CPU-intensive function in this sample; It has 0 CPU Time (see the CPU Time:Self column); most of the work is done in the OpenMP parallel region.

0 포인트
petersen__Kenni_dine
5,366 조회수

Hi Denis,

 

Ok, thank you for pointing that out. Is there some other recommended approach than using VTune then?

 

Regarding the function being CPU-intensive: Right, it is not in the presented version, but increase Nterms to, say, (long long int)1e10 or even higher, and it should start to be CPU-intensive, right?

 

BTW, I just tried to run with 1e11 terms in the sum and got this result:

 
 

Unavngivet.png

It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.

 

0 포인트
Denis_M_Intel
직원
5,338 조회수

I think there will be a separate internal function created for the OpenMP loop: largesum$omp$parallel_for and most of CPU time will be attributed to it. Increasing Nterms shouldn't affect largesum because the loop with all computations will be moved to  largesum$omp$parallel_for.

0 포인트
Bernard
소중한 기여자 I
5,315 조회수

>>>It is interesting that the type cast is so CPU-intensive compared to the sum-increment line which both involves an addition, a division and a multiplication. However, when optimization is disabled, most of the workload is indeed in the sum-increment-line.>>>

You should look at ICC generated assembly for those two optimization levels. For such a high number of loop trip count (1e11) the convergence of precision in sampling mode should be good and you can look at assembly and see which lines of machine code were marked as  hotspot "contributors".

 

0 포인트
RaeesaM_Intel
중재자
5,279 조회수

Hi,


Glad that your issue got resolved. We are discontinuing monitoring this thread. Please raise a new thread if you have any further queries.


Raeesa


0 포인트
응답