Xeon Phi 7250 64-second period event

Basden__Alastair · ‎12-08-2017

Hi,

We are running a real-time process on the 7250, which takes about 560 microseconds per cycle. Every 64 seconds, we see a spike in time, where it takes about 200 microseconds longer.

We've done significant investigation into this, but can't reach any conclusions. Does anyone have any ideas?

Is it something to do with management engine, or with memory refresh or something like that?

Thanks.

Basden__Alastair · ‎12-08-2017

Some additonal info:

Running in flat mode, with the process entirely within the MCDRAM

Kernel 4.14.4-1.el7

MGRAV · ‎12-12-2017

What did you try?

What is the affinity that you use ?

Can you always run the same amount of cycle before the spike appears ? or it fluctuates a little ?

Are you using all the cores ? all the threads ?

Basden__Alastair · ‎12-13-2017

We've isolated cores, locked threads to cores, lots of debugging, high thread priorities.

We're typically using 50-60 cores, and hyperthreading is switched off.

Its basically always 64 seconds - but the number of cycles fluctuates by 1 or 2 cycles, since the cycle time doesn't fit into an integer number of 64s periods (I guess).

Lots of kernel options, lots of bios options...

Basden__Alastair · ‎12-13-2017

The time of the spikes tyically varies by a few milliseconds - i.e. separated by about 64s, but varying by a few milliseconds...

jimdempseyatthecove · ‎12-13-2017

Can you sketch your loop, being very specific on where and how you take your timings, as well as how you are coordinating your threads. And show how you are performing any timed delays.

Notes:

You should be aware that on KNL that 2 cores share an L2. When using less than the full complement of cores .AND. if the selected number of cores does not have all selected cores with an L2 sibling then the execution time may vary amongst the cores. If your code is L2 sensitive, then your core selection should be such that no other O/S or other process running on the system interferes with your process L2 cache utilization.

If you desire to run with a subset number of cores, for example 50 of 64. Try setting your thread pool to some number over 50 (e.g. 60) then have the start of your DoWork code permit only the first 50 threads to enter the region with the others circumventing the code. If this produces desirable (better) results, then experiment tweaking the ratios. Note, changing core utilization will affect L2 utilization, this may or may not be adverse to your requirements.

Jim Dempsey

Basden__Alastair · ‎12-13-2017

Loop is not trivial. However, we take a timing once per loop, at the start. Threads are coordinated using spinlocks.

I think it is unlikely to be a L2 issue, since that wouldn't explain a 64s period (at least, not that I can tell) - that is approximately 115000 loop cycles.

With different parameters, problem sizes, the loop can run faster or slower, and still there is a 64s periodic event.

Thanks...

Basden__Alastair · ‎12-13-2017

We wonder whether it could be the Intel® Remote Management Module 4 (Intel® RMM4) Lite, which we think periodically polls the processor.

Does anyone know the polling period? Could it be this?

jimdempseyatthecove · ‎12-13-2017

>>Threads are coordinated using spinlocks

A typical spinlock is for a specific number of iterations. What you should consider using instead is to spin until the RDTSC is .GE. the next time interval.

// See: https://stackoverflow.com/questions/11706563/how-can-i-programmatically-find-the-cpu-frequency-with-c
// Or Google: get cpu frequency linux c __rdtsc
uint64_t ticks_per_second = get_ticks_per_second(); // you supply this using information from above links
// ** assumes no fractional ticks/second

// Assuming barrier to wait for 1ms time interval
uint64_t oo_interval_ticks = 1000; // 1 for second, 10 for 1/10s, 100 for 1/100s, 1000 for 1/1000s, ...
...
uint64_t start_tick = __rdtsc();
#pragma omp parallel
{
  // start at one interval past start_tick
  for(uint64_t interval = 1; interval <= n_intervals; ++interval)
  {
    // barrier until interval
    for(uint64_t until_tick = start_tick + ((ticks_per_second * interval) / oo_interval_ticks); __rdtsc() < until_tick;)
      continue; // or _mm_pause();

    // threads resume at interval
    ...
  }
}

Jim Dempsey

Basden__Alastair · ‎12-13-2017

We are using a pthread_spinlock in place of a pthread_mutex. So, not for a set number of iterations, but until other threads have unblocked.

I don't see how the use of a spinlock (with a maximum locking period of 0.00056 s would result in a 64s period spike...!

We're not blocking threads for a set period of time - the 560 microseconds is just how long the computation takes... the threads acquire a spin lock to synchronise. We also use atomics instead of a barrier... for global frame synchronisation at the end of each iteration.

However, I'm pretty sure the 64s spike isn't something due to our code (given that it happens regularly, after approx 115,000 iterations!).

Thanks...

jimdempseyatthecove · ‎12-13-2017

On my KNL system (CentOS 7.2), when I run top I see many instances of

watchdog/...
kworker/...
ksoftirqd/...
migration/...

processes (hadn't counted, likely one per core or thread).

These are likely helper processes, and I am guessing run on a timed interval (once per 64 seconds??). One or more of them is likely the culprit.

A bit of googleing might give some insight of what they do and possible tuning hacks to correct for this behavior.

Jim Dempsey

Basden__Alastair · ‎12-14-2017

Yes, likewise - on each core we have:

cpuhp

migration

ksoftirqd

kworker (4 times).

So - it could be one of those - but to get our observed behaviour, then a timed interval would have to be fired on all 64 or 72 cores at once.

So, the question is are these the culprit - or is it something lower level. My reason for posting on this forum was that I suspect there might be some sort of low level hardware (e.g. management engine or something, given recent press reports about Intel processors) that causes a slight delay every 64s (actually, 63.7s), Are there any Intel Phi architects who read this forum, and could shed some light?

Thanks...

jimdempseyatthecove · ‎12-14-2017

>>So - it could be one of those - but to get our observed behaviour, then a timed interval would have to be fired on all 64 or 72 cores at once.

Correct - all waiting on same event (clock tick).

I am guessing here about purpose.

One of those processes (per core/thread) is likely a low priority "NULL" job that runs when nothing else is available to run. IOW its runtime is the "not busy" time.

A second process is likely a "Has the core/thread locked not responding to normal system interrupts". This is likely a higher (highest) priority job that gets scheduled periodically, and if not observed to run when scheduled, assume core/thread locked at higher priority (e.g. inside O/S call).

A third process (run at lower/lowest priority) on the core/thread is "can I resume a process thread that is not previously preempted on my core/thread".

Not sure what the fourth process is.

>> or is it something lower level

The only other think I can think of is if the MCDRAM refresh occurs on all memory sticks/chips at the same time.

Maybe John McCalpin can comment on this.

Jim Dempsey

Malcolm_S_ · ‎12-14-2017

Have you disabled the SMI features in the BIOS?

There seems to be some evidence that it can trigger latency spikes at 32 or 64 seconds.

http://wiki.linuxcnc.org/cgi-bin/wiki.pl?FixingSMIIssues

You may also want to ensure that you have transparent huge pages disabled.

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Cheers,

Malcolm

McCalpinJohn · ‎12-14-2017

It sounds like the event is predictable. If so, then you can start an ftrace shortly before the event and run it for a few seconds. I used the "trace-cmd" front end for ftrace -- it made it easier to figure out how to use ftrace, but did not help make it easier to understand the results.

I spent most of two months crawling through gigabytes of trace output to show that there was a performance bug in the interaction of the "intel_pstate" driver and the "intel_idle" driver in CentOS 7.2 that occasionally caused logical processors that were supposed to be idle (C1 state) to be placed in "poll_idle" instead (spin-looping) for extended periods. If I was trying to use the sibling thread, it ran slower during these periods because the spinning thread ate up 1/2 of the reorder buffers (or reservation stations, or maybe both?). This was not fun.

After I was finished, I realized that I could have shown that the behavior existed by simply dumping the output of /sys/devices/system/cpu/cpu*/cpuidle/state0/time before and after my job. It would not have shown why the behavior occurred, but it would have proven that it happens.

My bug was specific to HyperThreading, and disappeared if I ran on the Logical Processor that was showing the bad behavior (rather than running on its sibling).

There is a piece of the Intel architecture that tries to make itself invisible to users -- System Management Mode.

If bit 14 of MSR 0x1d9 (IA32_DEBUGCTL) is set, then the performance counters and trace facilities are frozen while in System Management Mode. This is the default on some of my systems, but not on all.
MSR 0x34 (MSR_SMI_COUNT) is a running count of the System Management Interrupts (which cause the processor to enter System Management Mode).
- I have some systems that only use SMIs during boot, then the counter never increments again.
- I have other systems that generate 10 or more SMIs per second.

It should be easy to see if MSR_SMI_COUNT is incrementing on your system near the time when you have the slow iterations. If it is, then the notes below might help.

SMM mode is a package-wide state. All cores enter SMM mode when the package receives an SMI, but each core is allowed to leave independently. If performance counters are frozen in SMM mode, then you will see a difference between the fixed-function counter 2 CPU_CLK_Unhalted.Ref and the TSC -- even if you have kept the processor busy (so it should not have been halted). Frequency changes are another possible cause of halted cycles, so I recommend pinning the frequency. In my case, the number of "lost" cycles (using fixed-function counter 2) was small on most cores, but larger on core 0. This is not surprising -- all cores must enter SMM mode together, but if the SMM code is single-threaded, the rest of the cores will return immediately and only one core will actually run the SMM code.

I have not tested this, but I have seen no indication that the IA32_MPERF MSR (0xE7) will be "frozen" in SMM mode. It it continues running, then you can compare changes in IA32_MPERF with changes in fixed-function counter 2 to look for periods when the processor is not halted, but during which the performance counters are frozen.

One bizarre complication on KNL is the undocumented (?) feature that IA32_MPERF increments at 1/1024 of the TSC frequency, instead of at the full TSC frequency (as is the case for every other Intel processor that I have looked at). This makes my approach a little fuzzier, but it should still be workable.

McCalpinJohn · ‎12-14-2017

Just for fun, I decided to check one of my KNL systems for System Management Interrupts...

Surprise! I found that this system gets one about every 63 seconds!

$ for SEC in `seq 0 200`; do echo -n "$SEC "; rdmsr -p 0 -d 0x34; sleep 1; done
0 6246
1 6246
2 6246
[...]
57 6246
58 6247
59 6247
[...]
120 6247
121 6248
122 6248
[...]
183 6248
184 6249
185 6249
[...]

Unfortunately, there is no way of knowing what SMM code is running without direct information from the developer of the BIOS/firmware for your system. Sometimes there are BIOS options that enable/disable SMM functions -- often with names like "Ultra Spiffy Super Automatic System Health Checker!!!! (TM)". More commonly, there is no documentation of the existence of SMM functions or documentation of BIOS configuration controls of these functions.

It can be quite challenging to get a vendor to respond to SMM-related issues. Most customer-facing folks (even technical specialists) have no idea that this stuff exists, and no idea who in their company might know who is responsible for it. With sustained effort, I have managed to get vendors to fix BIOS bugs in the past, but I have never been able to get in direct contact with anyone at any of my vendors who had first-hand knowledge of this part of the system. I hope you will have better luck....

Basden__Alastair · ‎12-15-2017

Hi John,

Thanks - that's great - at least I'm not imagining things now!

Its a supermicro motherboard, but whether they also make the bios I'll have to check. We've already been through the bios options in quite a lot of detail, but now that I know the problem is there, I'll take another look...

JJK · ‎12-15-2017

@John: I learn something from you almost every day :)

Of course, I had to try this for myself ; on my KNL 7210 (Ninja dev machine, supermicro based board) I do not see this increase in interrupts; the result is steady at 261 for quite some time now. Similarly for a Skylake Gold server. On my Haswell laptop I see an SMI every 15 seconds, however.

jimdempseyatthecove · ‎12-15-2017

Great (complete) report John. Thanks for your time.

Alistair, from John's information ( In my case, the number of "lost" cycles (using fixed-function counter 2) was small on most cores, but larger on core 0.), and your statement of turning off HT, as hacks:

1) Try excluding core 0 from your worker pool.

2) try turning on HT, affinity pin your worker threads to HT sibling 1 (or any .not. 0), exclude core 0 as done in 1), and for the non-worker threads place them in a _mm_pause(); loop (if on KNC use the delay intrinsic). The idea is if SMM hits HT sibling 0 of each core, and not the others, you will have less lost CPU time (it won't be 0 but it could be less).

If this has any beneficial effect, please let others know.

Jim Dempsey

Basden__Alastair · ‎12-15-2017

Hi,

Just checked on our Ninja system, and agree with JJK that there is no interrupt there... will run the code here to check, and post back.

On our 4-node chassis systems, all get an interrupt at approx 60s. These actually use an Intel S7200AP motherboard, not supermicro as I mistakenly said previously. (the chassis is supermicro).

Jim - I'll try turning on HT as you suggest.

Core 0 is excluded from the pool (actually, cores 0-10 or so are excluded, depending on how many threads we're using).

Thanks...

Basden__Alastair · ‎12-18-2017

So, on the Ninja system, as expected, we don't see the 64s spikes.

Using hyperthreading actually makes the spikes worse - rather than 200us, they're more like 800us, and also, the computation time is longer.

So, I suppose the question now is whether it is possible to turn off the system management interrupts on the Intel motherboard... (if only there was an Intel forum we could ask on!!!).