Incoherent results from HPC Characterization analysis

cisternino__marco · ‎10-11-2018

Hi everybody,

I'm having some trouble in understanding VTune HPC Characterization analysis results of my CFD code. Considering its large size, I isolated one of its parallel zone and made the same analysis on this small portion, looping on it in order to have enough wall time.

The small toy code follows (the code is completely useless, but its analysis is quite similar to the one on the big code):

#include <omp.h>

#include <iostream>

#include <stdio.h>

int main() {

size_t size = 1000000000;

double * __restrict__ fluxesDataPtr = new double[size];

double * __restrict__ advFluxesDataPtr = new double[size];

double * __restrict__ difFluxesDataPtr = new double[size];

double * __restrict__ coefficientsDataPtr = new double[size];

for(int k = 0; k < 10; ++k){

#pragma omp parallel //num_threads(4)

{

//double wtime = omp_get_wtime();

#pragma omp for

for (size_t i = 0; i < size; ++i) {

//std::cout << "thread " << omp_get_thread_num() << " " << i << std::endl;

fluxesDataPtr = coefficientsDataPtr * (advFluxesDataPtr - difFluxesDataPtr);

}

//wtime = omp_get_wtime() - wtime;

//printf( "Time taken by thread %d is %f\n", omp_get_thread_num(), wtime );

}

printf( "Useless print %f\n", fluxesDataPtr[0]);

}

Consider also that measuring walltime "by hand", the code scales from 1 to 4 threads almost perfectly.

But Vtune HPC Characterization analysis gives what you can see in the attached image.

Just to be clear, the analysis ran on 4 threads of a Haswell i7-4700HQ with Ubuntu 14.04 and kernel 4.4.0-137 and the code has been compiled using

icpc (ICC) 18.0.3 20180410 with the -qopenmp flag.

The weird things are more than one: why, considering that the codes scales, the fourth thread does nothing? why nobody does nothing after 1.5sec? what is the relationship between elapsed time and the times in CPU Time column? Why is there no Spin Time even if all threads are doing nothing for the most of the elapsed time?

Finally, please, consider this small as a part of a much bigger one and feel free to ask any information you may need to better understand my problem.

Any help is really appreciated.

Thanks,

Marco

Vladimir_R_Intel · ‎10-12-2018

Hi Marco,

It seems on some issue with collector. Could you please send us your result or analysis settings at least?

Also did you install VTune collector drivers? Or you use driverless analysis?

In case of you didn't install drivers I suggest to do it and to repeat your experiment with hpc analysis without stacks.

BR,

Vladimir

TimP · ‎10-12-2018

After dealing with Vladimir's hints, you may try to examine each thread individually. In view of the large time spent in vmlinux, some of the threads may be incurring large spin times, possibly with the critical path being on one thread which does useful work for the entire interval. That leads to more investigation of work balance and trying to determine if your toy case represents the real workload. If you are looping over the same memory region, you might expect cache to be more effective in the toy version or even that not all threads are doing their share of the reduced memory access work.

cisternino__marco · ‎10-12-2018

Hi Vladimir and Tim and thank you for your quick reply.

First of all, results are attached to this comment. I post both the HPC analysis(results.tar.gz) and the Advanced Hotspot one(resultsAH.tar.gz)

Secondly, I installed the System Studio suite not explicitly installing the VTune Sampling Driver. In order to verify if the SS installer did the job for me, I launched the amplxe-self-checker.sh script and the output is attached to this comment. Just two comments on this output:

- I cannot disable HyperThreading from BIOS but I did it turning offline 4 of my 8 logical threads and analysis results are almost the same.

- I have no kernel in debug version nor kernel symbol tables as you can see from the selfCheckerLog.txt file

Finally, I don't understand if I have the sampling drivers installed, but the check script doesn't complain about their lack. What I can see is that in "system_studio_2018/vtune_amplifier_2018.3.0.566015/sepdk" folder I have 4 subfolders, i.e. include prebuilt src and vtune-layer. I would say I don't have the drivers installed, but I cannot really realize it.

@Tim: as you can see from Effective CPU Utilization plot in Advanced Hotspot analysis the code use almost always 4 thread and spin time is low. But in HPC analysis the same code seems to do nothing on 4 threads for the last 80% of the CPU time, having the 4th thread doing nothing at all. This is quite weird for me.

Thanks again to everybody. And if you need more, please don't hesitate.

Marco

Dmitry_P_Intel1 · ‎10-12-2018

Hello Marco,

Thank you for the results - they were helpful.

Intel sampling driver is loaded and we can see that Advanced Hotspot result is reasonable.

In the HPC Performance result we can see that sampling collector was stopped by some reason after 2.4 seconds. The remained User API collector continued to collect OpenMP regions etc. So that's why we don't see any CPU sample after 2.4 second. Is the behavior with HPC Performance reproducible? If so as a workaround I can recommend you to switch on call stack collection (it uses a different driver). Meanwhile we still will need to investigate further why the first collector is stopped unexpectedly if it is reproducible.

On Advanced Hotspot result we can see that the imbalance is quite small to try to tune it (the wall time impact is 0.33 sec).

What I observed from the AH result is that there are some pieces of time when two worker threads are scheduled to logical cores of one physical core. It is not necessarily bad but I would use affinity like:

export OMP_PLACES=cores

export OMP_PROC_BIND=close

to avoid this.

cisternino__marco · ‎10-12-2018

Vladimir,

I launched a HPC analysis exporting the variables you suggested and using only analysis on OpenMP regions.

Now I can understand the results, attached as resultsHPC.tar.gz.

I did the same analysis activating memory bandwidth analysis(resultsHPCBandwidth.tar.gz). I think almost the same common results.

And I activated the stacks collection (resultsHPCStack.tar.gz), experiencing not the same original issue, but seeing differences relative to the previous analysis in this comment.

However, I would need stacks collections, what can I do to have reliable results?

Thanks a lot,

Marco

Vladimir_R_Intel · ‎10-15-2018

Hi Marco,

so as I don't see any difference in the settings of hpc result with original issue vs. in the resultsHPC.tar.gz you sent with the last message I conclude that it was some sporadic issue.

Relatively to your question:

>>However, I would need stacks collections, what can I do to have reliable results?

The only things I can suggest here are:

1. to use the latest and greatest VTune version, currently it 2019 Gold and 2019 U1 will be available soon.

2. to determine an area of possible improvements basing on the common results. Currently you have HPC and Advanced Hotspots and this a good entry point. And then to dive in this area with switch off all unnecessary features (to decrease a possible noise). Looked at your result I would focus on hardware utilization (general-exploration analysis and advanced-hotspots with stacks analysis).

BR,

Vladimir