Hi Vitaly,

Hagen_G_ · ‎01-16-2015

Hi,

we regularly use VTune to optimize our software. Recently we are linking to Qt5 instead of Qt4. Exactly after this modification of our software, advanced hotspots analyses of some of our software workflows take significantly longer. Example runtimes for exactly the same workflow:

Qt4: 14 s runtime without VTune attached, 48 s runtime with VTune attached
Qt5: 18 s runtime without VTune attached, 4 minutes with VTune attached

So this is an analysis runtime increase of factor 5. (With "analysis runtime" I mean the time which elapses between pressing the start and the stop button.)

I am currently analysing why our software runs slower with Qt5. First hint is (I have found this with VTune) that the QMutex implementation in Qt5 is completely different from Qt4 (e.g. there is no spinning anymore before locking a mutex).

Has anyone else experienced similar slow-downs especially concerning Qt4/Qt5? Can anyone suggest useful analysis steps? (What I am currently trying: patching the Qt4 mutex into Qt5 and see what happens. First result: VTune runtime still slower with patched Qt5.)

Best regards, Hagen

Hagen_G_ · ‎01-16-2015

By the way: I am using VTune XE 2015 build 367959

Peter_W_Intel · ‎01-16-2015

I'm not familiar with Qt. In my view, if using QMutex of Qt5 runs more 4s without VTune than Qt5's, VTune might have more overhead time of Qt5 than Qt4 since VTune instruments QMutex to trace spin time / overhead time. You can verify:

1. Top-down report : know where CPU time was from your code, QMutex, and VTune itself (Column have "CPU time Self", "CPU time total")

2. Use Lockandwaits analysis to know wait time / wait count of sync-objects.

Do you know if lock-free mechanism is supported on Haswell processors? I practiced this in a simple demo, performance got extra upgraded, read this article. If you don't use Intel compiler and on Haswell processor, you might use light Mutex, for example, queue mutex from Intel TBB is good on performance. Other thought is optimize(reduce) your protection on global space, use atomic on elements, for example.

Vitaly_S_Intel · ‎01-17-2015

Which analysis type are you using within VTune?

I suppose Advanced Hotspots doesn't cause that slowdown, can you confirm? Please try with and without stacks analysis.

Hagen_G_ · ‎01-20-2015

Thanks to Peter and Vitaly for the hints.

I have meanwhile investigated if using the Qt4 mutex implementation in Qt5 has any effect. Result: no effect, thus Qt5's new mutex implementation is not the reason for my slowdowns.

I have 2 problems:

(1) Problem not related to VTune: My software runs slower with Qt5 than with Qt4. @Peter: Yes, I will look at the analysis results again and try to find out more... I'm using an Ivy Bridge CPU, no Haswell.

(2) Problem related to VTune: A VTune advanced hotspots analysis takes considerably longer when I use Qt5 instead of Qt4. @Vitaly: I have tried with and without stacks analysis (using another PC, not the same one used for my first posting). Result: without stacks analysis 20 s, with stacks analysis 5 min 20 s (!) => Yes, it's really the stack tracing which is the bottleneck here. Thanks for the hint. It probably makes sense that tracing Qt5 stacks is just more complicated than Qt4 stacks (the Qt5 project is quite more complex here and there) - although I wouldn't have expected such a big difference. Btw: I used sampling interval 10 ms. I don't know - maybe VTune has an O(n²) problem concerning the depth of call stacks?

Vitaly_S_Intel · ‎01-21-2015

Hi Hagen!

Advanced Hotspots with stacks obviously causes higher collection overhead than AH without stacks. Moreover it depends on application. Can I ask you to make one more experiment to track overhead? Although you'll need to upgrade to Update 1.

- Press "New Analysis" button, select Advanced Hotspots, select "Hotspots, stacks and context switches"

- Press "Copy" button on top right corner - you'll create custom analysis

- Scroll options list down and unselect "Collect context switches" option

- Press OK, launch your custom analysis and measure application runtime

Is it still too high?

Hagen_G_ · ‎01-28-2015

Hi Vitaly,

thanks for the hints! Here are my results (for Qt5):

- VTune 2015, with call stack: 320 s

- VTune 2015, without call stack: 20 s

- VTune 2015 Update 1, with call stacks and context switches: 305 s

- VTune 2015 Update 1, with call stack, but without context switches: 45 s

This tells me that I have a thread/context switching problem. I will investigate further into that direction.

Thanks again and best regards, Hagen

Significant analysis time increase after switching from Qt4 to Qt5