Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5101 Discussions

doubled thread cycles with activated hyperthreading (core i7)

lkleen
Beginner
451 Views

Im currently comparing the performance of an application running 8 audio threads on 8 logical cores with the performance of the same project running on 4 cores with disabled hyperthreading. With activated Hyperthreading the audio processing gains a speedup of about 7%. When analyzing the project with VTune Im measuring a CPI value of 0.47 if hyperthreading is disabled but 0.86 if it is enabled.

When Im running the same Project with 4 Audio threads the CPI value for 4 cores is 0,47 and 0,54 for 8 cores while the processing time slows on 8 cores. This is the association I would expect, the other values are confusing me. How is the value for CPU_CLK_UNHALTED.THREAD determined on a system with activated hyperthreading?

0 Kudos
1 Solution
Shannon_C_Intel
Employee
451 Views

Hello,

This is a great question! While Hyperthreading usually helps performance, as you have seen it can complicate performance analysis. All the CPIs you measured are correct. The problem is that with Hyperthreading enabled, there are actually 2 different definitions of CPI. We call them "Per Thread CPI" and "Per Core CPI". The .47 CPI you measured with HT disabled is a per core CPI. The .86 you measured with HT enabled is a per thread CPI. (More specifically, with HT disabled per core CPI and per thread CPI are the same, so the .47 CPI you measured in the non-HT caseis both a per core and per thread CPI - but we call it "per core".)

The CPU_CLK_UNHALTED.THREAD counter measures clockticks on a per thread basis. So for each tick of the CPU's clock, the counter will count 2 ticks if HT is enabled, 1 tick if HT is disabled. So, when HT is enabled, as it is in your second case, and you sample with this event, you are getting 2 clockticks for every 1 actual elapsed cycle's worth of time. So - what that means is, in the same period of wall-clock time, your total unhalted cycles is doubled what it would be in non-HT mode. From a thread point of view, the CPI is higher with HT on (it does take more clocks to process an instruction because each tick counts twice). From a core point of view, if HT helps your application, as it does in your case, the CPI will be lower with HT on (because the actual time to process instructions reduces). To convert thread CPI to core CPI, divide the thread CPI by 2. (Only do this when you are dealing with an aggregated core CPI, with all the threads added into it, as you get with VTune's process view.)

So your data becomes:

8 logical threads on 8 cores: per core CPI of .47

16 logical threads on 8 cores: per core CPI of .43 (.86/2)

The lower per core CPI with HT enabled gives you your performance boost.

Most (but not all!) of the counters available for Core i7 count per thread. There are some alternative ways to look at efficiency (besides CPI) - one other methodI like is (UOPS_EXECUTED.CORE_STALL_CYCLES / (UOPS_EXECUTED.CORE_ACTIVE_CYCLES + UOPS_EXECUTED.CORE_STALL_CYCLES)) * 100. This tells you the percentage of execution stalls. When this goes up, it usually means worse performance.

Hope this helps. Enjoy using VTune.

-Shannon

View solution in original post

0 Kudos
1 Reply
Shannon_C_Intel
Employee
452 Views

Hello,

This is a great question! While Hyperthreading usually helps performance, as you have seen it can complicate performance analysis. All the CPIs you measured are correct. The problem is that with Hyperthreading enabled, there are actually 2 different definitions of CPI. We call them "Per Thread CPI" and "Per Core CPI". The .47 CPI you measured with HT disabled is a per core CPI. The .86 you measured with HT enabled is a per thread CPI. (More specifically, with HT disabled per core CPI and per thread CPI are the same, so the .47 CPI you measured in the non-HT caseis both a per core and per thread CPI - but we call it "per core".)

The CPU_CLK_UNHALTED.THREAD counter measures clockticks on a per thread basis. So for each tick of the CPU's clock, the counter will count 2 ticks if HT is enabled, 1 tick if HT is disabled. So, when HT is enabled, as it is in your second case, and you sample with this event, you are getting 2 clockticks for every 1 actual elapsed cycle's worth of time. So - what that means is, in the same period of wall-clock time, your total unhalted cycles is doubled what it would be in non-HT mode. From a thread point of view, the CPI is higher with HT on (it does take more clocks to process an instruction because each tick counts twice). From a core point of view, if HT helps your application, as it does in your case, the CPI will be lower with HT on (because the actual time to process instructions reduces). To convert thread CPI to core CPI, divide the thread CPI by 2. (Only do this when you are dealing with an aggregated core CPI, with all the threads added into it, as you get with VTune's process view.)

So your data becomes:

8 logical threads on 8 cores: per core CPI of .47

16 logical threads on 8 cores: per core CPI of .43 (.86/2)

The lower per core CPI with HT enabled gives you your performance boost.

Most (but not all!) of the counters available for Core i7 count per thread. There are some alternative ways to look at efficiency (besides CPI) - one other methodI like is (UOPS_EXECUTED.CORE_STALL_CYCLES / (UOPS_EXECUTED.CORE_ACTIVE_CYCLES + UOPS_EXECUTED.CORE_STALL_CYCLES)) * 100. This tells you the percentage of execution stalls. When this goes up, it usually means worse performance.

Hope this helps. Enjoy using VTune.

-Shannon

0 Kudos
Reply