Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

50uS stall when sharing memory over QPI

Nightingale__Will
945 Views

Hello,

We have an issue with a dual socket E5-2650 v4 system. Under certain conditions we are seeing a ~50uS stall and we're trying to figure out exactly what's going on so we can make sure to avoid it.

We have one thread on one package, and a second on the other. Each thread does some work, and the first thread is feeding data to the second. We see the stall happen every few seconds, and it affects all cores in the system,

If the threads are separated such that the second is reading its input from a memory address unrelated to the first, then the stalls go away. If the threads are on the same package then the stall goes away. If the second thread has a very small workload then the stall goes away, but also if the second thread has a very high workload it also goes away.

So, it seems to be related to memory sharing across the QPI link between two packages.

We think we've ruled out the following:

- P / C state changes - We have EIST and Turbo disabled, and limit C state changes in the BIOS

- AVX warmup / down. We have seen this before (more so in packages with < 10 cores and even more so in Haswell), but it has always been accompanied by a FREQ_TRANS_CYCLES event in the uncore PCU box, and this counter is not incrementing in this case. Also, the bit about the memory sharing doesn't seem to fit this.

- Uncore freq change - no FREQ_TRANS_CYCLES, also tried limiting the uncore freq range using MSR 0x620.

Our first theory was QPI power management (L0 / L0p / L0s / L1 transitions), but we have set the following in the BIOS (SuperMicro X9DRi):

Link L0p - Disabled

Link L1 - Disabled

And we are not seeing the relevant uncore performance counters in the QPI box increment (TxL0P_POWER_CYCLES, RxL0P_POWER_CYCLES, L1_POWER_CYCLES). There was no counter documented for L0s cycles.

But maybe the BIOS is bad, and the uncore counters are not to be fully trusted? I couldn't find any MSRs directly related to the BIOS settings above to confirm they are set.

The other thing we have found is that the stalls go away when NUMA is disabled in the BIOS. Maybe this is the fix - it doesn't seem to affect our application's overall performance - but we'd like to understand why it works before applying it.

Many thanks in advance,

Will

0 Kudos
8 Replies
Vitaly_S_Intel
Employee
945 Views

Well, just to exclude any kernel-mode activities (e.g. interrupts), try VTune "System Overview" analysis using "Hardware Tracing" mode. This analysis is based on Intel Processor Trace and captures all the activities on CPU cores providing precision up to nanoseconds. If stall is caused by CPU activity, you'll catch it.

0 Kudos
McCalpinJohn
Honored Contributor III
945 Views

If you are running Linux, you will want to make sure that automatic NUMA page migration is disabled.  (This will be automatically disabled if NUMA is disabled in the BIOS.)

cat /proc/sys/kernel/numa-balancing

If set to "1", the OS will shoot down TLB entries for user pages so that (on the next page access) it can monitor whether the page is being accessed by a local core or a remote core.  If enough accesses are remote, the OS will schedule the page to the migrated to the NUMA node that uses it the most.   If set to "0", then this function is disabled.

 

0 Kudos
Nightingale__Will
945 Views

Thanks for the responses.

We are running an RTOS so unfortunately we can't run VTune, but it does mean we can worry less about what the kernel is doing. The threads in question are setup never to be scheduled out, and we can see through tracing that this is indeed the case. We also checked MSR_SMI_COUNT (0x34) and it is not incrementing. This is the first time we are trying out a Broadwell dual socket system, but the same application (from which the test described above is distilled) runs on Broadwell single-socket and Ivy Bridge dual-socket setups with no stalls.

The RTOS does not perform any NUMA page migration like you describe.

 

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
945 Views

In addition to stalls related to frequency changes with AVX256 code, it is also possible to have stalls related to "voltage-only" changes.   These are first referenced in the documentation for the Skylake Xeon processor, but it is certainly possible that the infrastructure of Haswell/Broadwell also includes this "feature".   There is an interesting discussion at https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Does your Broadwell single-socket box use a Xeon E5 v4 processor?  The Xeon E3 v4 processors using the "client" silicon, which does not have the same power management infrastructure as the server silicon.

Does this "stall" correspond to halted CPU cycles?  This should be confirmed using the TSC, fixed-function counters 0 and 1, and the APERF and MPERF MSRs.

The BIOS may have options related to snoop mode.   If the stall is due to something coherence-related, switching from "Home Snoop" to "Early Snoop" (or even "Cluster on Die") may change the behavior enough to make the problem go away (or get worse).

0 Kudos
Nightingale__Will
945 Views

Does your Broadwell single-socket box use a Xeon E5 v4 processor?  The Xeon E3 v4 processors using the "client" silicon, which does not have the same power management infrastructure as the server silicon.

Yes, Xeon E5 v4. Various processors including E5-2618L v4, E5-2680 v4, E5-2695 v4.

Does this "stall" correspond to halted CPU cycles?  This should be confirmed using the TSC, fixed-function counters 0 and 1, and the APERF and MPERF MSRs.

Very interesting question - it appears not.

In our test setup, each thread does some work which normally takes ~6uS. It does this every 20uS, and busy waits the rest of the time. Every few seconds, both threads have a single run that takes instead 55-65uS to complete their work, before execution times go back to normal in the very next run. This is what we have been referring to as the "stall". Execution time outside of these blips is consistent within ~1uS.

We take readings of the performance counters at the start and end of the work. Normally the Cycles Per Instruction (CPU_CLK_UNHALTED__THREAD_P / INST_RETIRED_ANY_P) is 0.5 to 0.6. But during a problem run, the CPI jumps to 5.5 to 6.5.

To count halted cycles I subtracted  CPU_CLK_UNHALTED__THREAD (fixed counter 1) from the number of TSC ticks. And during a problem run, this came out to zero (or close enough, the counters aren't read exactly simultaneously). The APERF/MPERF ratio was 1.000. Am I using these counters correctly?

I guess this means the CPU is not halted or running at a different frequency when these issues occur?

The BIOS may have options related to snoop mode.   If the stall is due to something coherence-related, switching from "Home Snoop" to "Early Snoop" (or even "Cluster on Die") may change the behavior enough to make the problem go away (or get worse).

I have experimented with different snoop modes, and possibly there was some reduction in the number of stalls seen when CoD and Early Snoop were enabled, but it did not make the problem go away.

 

0 Kudos
McCalpinJohn
Honored Contributor III
945 Views

It sounds like you are using the counters correctly, and have demonstrated that there is no halted time and no change in frequency from the expected nominal (non-Turbo) value.

It may be interesting to examine IA32_DEBUG_CTL (MSR 0x1d9).   Some of the trace facilities may be able to generate relatively large "hardware" overheads when triggered.   If FREEZE_ON_SMM (bit 14) is set, then the absence of a difference between cycle counts and TSC counts is definitive evidence that SMMs are not involved.  (MSR_SMI_COUNT by itself may not be enough, since the SMM code is capable of decrementing the counter before exiting.)

Xeon E5 v4 is the first generation to support Intel Resource Direction Technologies (RDT) "allocation" functions across most models.  You will want to check the output of CPUID to see if Cache Allocation Technology (CAT) and/or Code and Data Prioritization (CDP) are supported.  If so, then you will need to check the relevant MSRs to make sure neither feature was inadvertently enabled.   It looks like IA32_L3_QOS_MASK_0 (0xc90) is the base of the range of registers for setting up CAT, and IA32_L3_QOS_CFG (0xc81) is used to enable CDP.

The next step is to run a brute force sweep on uncore performance monitoring events.  These are described in document 334291, which unfortunately can't be found directly by that number.  Today I found a copy at https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html

I have not looked at these events in a while, but I would start with:

  • CBo events LLC_LOOKUP, LLC_VICTIMS, and events including "RETRY".
  • HA events SNOOP_OCCUPANCY, DIRECTORY_LOOKUP, DIRECTORY_UPDATE, DIRECTORY_LAT_OPT, HITME_LOOKUP, HITME_HIT

The CBo performance counters are in MSRs, so they will require expensive kernel accesses.  On a Xeon E5-2680 v4, there are 14 CBos with a total of 64 counters.  If you have to cross into the kernel for each of these it can get relatively expensive -- probably at least 200 microseconds?  Fortunately the state is unrelated to the processor core you want to monitor, so you can grab all this using another core in a kernel-space driver.  That gets the cost down to a few hundred cycles per RDMSR instruction -- probably under 10 microseconds for 64 counters.

The HA performance counters are in PCI configuration space.  In Linux, I do an mmap() of /dev/mem, starting at the base of PCI configuration space and with a 256MiB range and assigning it to a pointer of type "uint32_t".   Then I can read and write the targets in PCI configuration space using simple array references.  They are internally converted to uncached loads and stores, which (for addresses mapping to uncore devices on the local processor chip) have latencies that are generally similar to the latencies of local RDMSR instructions -- maybe 300-400 cycles per operation.  There are only two Home Agents, so only 8 counters to read.

You probably can't monitor at the full transaction rate, but if you can get the monitoring interval down to a few transactions, then you should be able to see differences between intervals that contain "slow" executions and the normal intervals.   I use RDTSCP timestamps to correlate between the timelines from the working processors and the timeline from the monitoring code.

(My general-purpose infrastructure for SKX processors is at https://github.com/jdmccalpin/periodic-performance-counters.   I can use it for monitoring down to ~1 second intervals, but it would need a special kernel module to bundle the RDMSR commands to be able to sample at intervals useful to your application....)

0 Kudos
Nightingale__Will
945 Views

Thanks for the very detailed suggestions.

I have tried those initial checks and indeed:

- There are no trace features enabled in IA32_DEBUGCTL (msr value is 0x0).

- If I enable the FREEZE_WHILE_SMM bit, the number of halted cycles by the previously mentioned metric is still 0.

- L3 CAT is supported, but all processors are set to the same Class of Service (IA32_PQR_ASSOC == 0x0), and all 15 CoS have the same mask settings, so nothing going on there.

- Code and Data Prioritisation Technology is disabled (IA32_L3_QOS_CFG == 0x0).

For the brute force scan in the uncore, we have tooling to access and configure the uncore perormance counters in MSR and PCICFG space, but we need to re-work it to properly achieve this and, typically, other things have taken priority in the short term.

What do you mean by the "transaction rate"? QPI transactions? Do you think we need a higher resolution than just looking at the total count values for each 6 to 60uS run? If not then we can take the reading offline by using the UBox global freeze/unfreeze feature which will greatly simplify things.

Again, thanks for your help.

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
945 Views

By "monitor at full transaction rate", I meant measuring the counters at a time scale comparable your 20 usec  work repeat rate or the 6 usec (typical) work periods.  

I don't have kernel extension that lets me read all of the counters in a single transaction.  (The "msr_batch" function in https://github.com/LLNL/msr-safe looks like a step in the right direction?)  I have also not looked into the timing/throughput issues of having different cores read different uncore MSRs concurrently (within a kernel module).  So I don't have a good estimate of how long it would take to the counters across all of the uncore boxes.

The global freeze/unfreeze functionality may do exactly what you need, but I have not tested it at very fine granularity so I would not immediately trust it....

The slowdown is big enough that you can probably get clear counter deltas if you can get pretty close to synchronizing your measurements to an integral multiple of the 20 usec repeat rate.  Even 10 "cycles" (200 usec) is a very short repeat rate for reading a full set of uncore counters, but if my SWAG of ~250 cycles per RDMSR instruction is close, then inside a kernel module a single core could read 64 CBo counters in about 8 microseconds (at 2.0 GHz), which should be good enough.

0 Kudos
Reply