Tuning Advice given by Intel VTune for tbb.dll and tbbmalloc.dll

Shankar1 · ‎11-23-2009

I sampled my application with Intel VTune profiler and the following is about the results it gave.

My application comprises of an application.dll and a Test.exe both of which extensively use tbb.dll and tbbmalloc.dll. I use task interface, concurrent_queue, concurrent_hash_map from tbb.dll and cache_aligned_allocator from tbbmalloc.dll.Iam using the tbb22_20090809oss version.

Here are the results which the Intel Tuning assistant gave
Process/Module Summary (Process: test.exe, Module: msvcr80.dll, RVA: 0x23ed-0x504a7)
CPU_CLK_UNHALTED.CORE: 18,501,600,000

Time Statistics
CPU_CLK_UNHALTED.CORE: 18,501,600,000 events
Processor Time: 7.73 sec
Accounts for 44.48% (workload)

Process/Module Summary (Process: test.exe, Module: test.exe, RVA: 0x10b0-0x7f06)
CPU_CLK_UNHALTED.CORE: 18,400,800,000

Time Statistics
CPU_CLK_UNHALTED.CORE: 18,400,800,000 events
Processor Time: 7.69 sec
Accounts for 44.24% (workload)

Process/Module Summary (Process: test.exe, Module: application.dll, RVA: 0x1017-0x488a)
CPU_CLK_UNHALTED.CORE: 225,600,000

Time Statistics
CPU_CLK_UNHALTED.CORE: 225,600,000 events
Processor Time: 0.094 sec
Processor Time: 0.094 sec
Accounts for 0.54% (workload)

Other Possible Problems
CPI (Cycles Per retired Instruction) is poor: 1.92 clockticks per instructions retired

Process/Module Summary (Process: test.exe, Module: tbb.dll, RVA: 0x91d0-0x1de61)
CPU_CLK_UNHALTED.CORE: 1,156,800,000

Time Statistics
CPU_CLK_UNHALTED.CORE: 1,156,800,000 events
Processor Time: 0.48 sec
Processor Time: 0.48 sec
Accounts for 2.78% (workload)

Other Possible Problems
CPI (Cycles Per retired Instruction) is poor: 2.8 clockticks per instructions retired

Process/Module Summary (Process: test.exe, Module: tbbmalloc.dll, RVA: 0x1904-0x4139)
CPU_CLK_UNHALTED.CORE: 206,400,000

Time Statistics
CPU_CLK_UNHALTED.CORE: 206,400,000 events

Processor Time: 0.086 sec
Accounts for 0.5% (workload)

Other Possible Problems
Branch mispredictions impact performance: 15.29 % cycles spent in branch misprediction recovery

Advice:
Use the precise events to focus on instructions of interest.
Eliminate branches
Use constants rather than variables or parameters
Improve branch predictability.
Compile with the Interprocedural Optimizations (IPO) switch
Compile with the Profile-guided Basic-block Optimization.
Consider assembly-level branch-prediction tuning.
Measure events required to compute advanced event ratios.

CPI (Cycles Per retired Instruction) is poor: 3.07 clockticks per instructions retired
Advice:
Measure events required to compute advanced event ratios.

Many L2 cache demand misses: 0.0081 L2 cache demand misses per instruction retired
Advice:
Use the precise events to focus on instructions of interest.
Improve data locality, if possible.
Consume data in chunks that fit in the L2 cache.
Better exploit the hardware prefetchers.
Use software prefetching.

Many L2 data cache misses: 0.022 L2 cache misses per instruction retired
Advice:
Use the precise events to focus on instructions of interest.
Improve data locality, if possible.
Consume data in chunks that fit in the L2 cache.
Better exploit the hardware prefetchers.

Many TLB misses: 7.31 % cycles spent on TLB misses
Advice: Measure events required to understand the type of TLB misses.

As seen in the results the Intel Vtune profiler doesnt have much to advice on my application.dll and Test.exe. But it gives lot of advices on tbbmalloc.dll and tbb.dll. Does this have anything to do with the way of usage of tbb?

What is CPI(cycles per retired instruction)? what value of it is not poor?

why does it show a huge Branch mispredictions impact performance in tbbmalloc.dll? And are the values of L2 cache demand misses, L2 data cache misses, TLB misses acceptable? Also is L2 cache demand miss and L2 data cache miss related to each other?

Dmitry_Vyukov · ‎11-23-2009

Quoting - Shankar

What is CPI(cycles per retired instruction)? what value of it is not poor?

It's just how many cycles CPU spent per instruction on average.
Modern Intel CPUs are able to retire (execute) some 3-4 instructions per cycle, so ideal CPI may be as low as 0.33-0.25.
I guess CPI of 1 is Ok for most applications.
CPI == 3 is somehow high. This means that CPU retires 1 instruction every 3 cycles on average, i.e. CPU goes on 1/10 of it's full speed. It suggests that there is significant amount of cache misses, TLB misses, mispredicted branches, etc.

Shankar1 · ‎11-23-2009

Quoting - Dmitriy Vyukov

It's just how many cycles CPU spent per instruction on average.
Modern Intel CPUs are able to retire (execute) some 3-4 instructions per cycle, so ideal CPI may be as low as 0.33-0.25.
I guess CPI of 1 is Ok for most applications.
CPI == 3 is somehow high. This means that CPU retires 1 instruction every 3 cycles on average, i.e. CPU goes on 1/10 of it's full speed. It suggests that there is significant amount of cache misses, TLB misses, mispredicted branches, etc.

Ok. Well is there anything that I can do about this to reduce those numbers. I mean can I influence those numbers by way of using TBB cache_aligned_allocator in a specific way. Or is it something that TBB has to address?

Dmitry_Vyukov · ‎11-23-2009

Quoting - Shankar

Ok. Well is there anything that I can do about this to reduce those numbers. I mean can I influence those numbers by way of using TBB cache_aligned_allocator in a specific way. Or is it something that TBB has to address?

It highly depends.

Low CPI does not necessary means a problem. For example some optimization may increase CPI by a factor of 2, but at the same time reduce total number of instructions by a factor of 4. So you have 2x improvement in total.
Moreover, code that deals with inter-thread synchronization has high CPI basically by definition, because synchronization is costly, there are always cache misses (cache line transfers between cores), atomic RMW operations that takes up to 50 cycles each.

You may optimize your code (that probably does not deal with inter-thread synchronization directly), but for your module (test.exe) high CPI is not reported. High CPI is reported for application.dll, I guess it's your module, but it accounts only for 0.54% of execution...

There are always some ways to optimize performance further, but they are highly dependent on particular application.

Shankar1 · ‎11-23-2009

Quoting - Dmitriy Vyukov

It highly depends.

Low CPI does not necessary means a problem. For example some optimization may increase CPI by a factor of 2, but at the same time reduce total number of instructions by a factor of 4. So you have 2x improvement in total.
Moreover, code that deals with inter-thread synchronization has high CPI basically by definition, because synchronization is costly, there are always cache misses (cache line transfers between cores), atomic RMW operations that takes up to 50 cycles each.

You may optimize your code (that probably does not deal with inter-thread synchronization directly), but for your module (test.exe) high CPI is not reported. High CPI is reported for application.dll, I guess it's your module, but it accounts only for 0.54% of execution...

There are always some ways to optimize performance further, but they are highly dependent on particular application.

yes application.dll is my module n I have used an atomic RMW(fetch_and_increment and fetch_and_decrement) to serialize the execution of a function. That I guess answers the CPI being high.

n thank you for all this information which I dint know earlier.