VTune Analysis

Amit_T_1 · ‎08-19-2015

Hello,

Our OS: Linux 6.4 Santiago with 32 Core Servers and HT Enabled.

We are a financial trading software firm and during the performance optimization using Intel VTune - General Exploration Analysis we observed

L1D Replacement % = 1.0

L2D Replacement % = 1.0

LLC Replacement % = 1.0

in our VTune output.

Our application is completely single threaded and we are making sure that the make application thread is always pinned to a core. Is it a possibility that because of HT enabled we see the above numbers because cores on the same processor share resources like L1D, DTLB, ITLB? Has anyone seen this kind of behavior before.

For trading applications where latency matters does switching HT off is advisable?

Thanks.

Dmitry_R_Intel1 · ‎08-20-2015

These metrics are calculated by dividing the number of replacements by total number of replacements in the whole experiment. So they make sense only in the grid view where you'll see a breakdown of the replacements by hotspots (e.g. by functions). In summary report/view they are useless since they will always be 1.0.

Amit_T_1 · ‎08-20-2015

Thanks.

Though wanted to confirm but in theory HT can effect your latency but help you with throughput. Is that a right statement to make?

McCalpinJohn · ‎08-20-2015

The performance impacts of HyperThreading are complex, with many special cases -- especially if the workload is sensitive to run-to-run variability or timing "jitter".

*IF* you run one process per core, and use process binding to prevent process migration, and are running with HyperThreading enabled, you will typically see no significant change in system throughput. In this case you may see a *reduction* in OS-induced "jitter" because OS threads will be able to run on the alternate thread context(s) that the user is not using. This will cause a slight reduction in user thread performance (due to sharing of the core), but sharing a core will (usually) add less latency than having the OS steal the entire core while it executes whatever service is required.

If you run more than one user process per physical core, you can very often get increased throughput, but it is difficult to avoid increased performance variability. The OS does not know which processes can share a core with minimal contention, and which processes generate a lot of contention when sharing a core, so user management (i.e., explicit process and/or thread binding) is often required to achieve acceptable throughput improvements with a tolerable increase in run-time variability.

In environments that just want better throughput with little concern for predictability, repeatability, and/or jitter, HyperThreading should be enabled.

In user environments that require predictable and repeatable performance (e.g., multiple processes operating in "lock-step" and periodically waiting for the slowest processes to catch up), you need to explicitly bind processes whether you are using HyperThreading or not.

We run almost all of our production systems with HyperThreading disabled because we find that this gives the operating system less opportunity to mess up performance for users that don't carefully control their execution environment. This is more about avoiding worst-case behavior (and the subsequent increase in user support that we need to provide) than improving average or best-case behavior.

Amit_T_1 · ‎08-21-2015

Thanks John. Our main concern is process latency and not throughput.

We do pin the process to a single numa node and the all the critical threads to the core. But it seems to me that problem can creep in if two threads are sharing the resources on a physical core with HT enabled. We could see data and instruction cache misses.

McCalpinJohn · ‎08-25-2015

Running two threads on one core will certainly decrease the performance of each thread and will almost always increase the performance variability as well. You can often get more throughput this way, but that just means that you get twice as much work done in less than twice as much wall clock time. If latency is your concern this is probably not the way you want to go.

The issue I was trying to address in my previous note is whether you want to disable HyperThreading in the BIOS or leave it enabled and only place one user application thread on each core. Performance should be very similar in the two cases on recent hardware. My guess is that the system will have slightly lower performance variability if you have HyperThreading enabled, since the OS will be able to run on the alternate thread context of each core. While this will slow the application down a little bit, it should slow it down less than if the OS paused the application thread so that it could use the entire core to execute the OS service or daemon. This is likely to be a small effect in most environments -- much smaller than the negative performance impacts that arise when you don't pin all your application threads and the OS schedules things badly.