Solved: unexpected NMI received -- with or without SEP

McCalpinJohn · ‎05-14-2018

We have KNL and SKX systems running CentOS kernel 3.10.0-693.17.1.

The KNL systems are currently running the Intel sep4_1 driver that came with VTune amplifier 2018.0.2 build 525261, while the SKX systems are running with the "perf events" driver.

In both cases, attempting to use the "-collect memory-access" option to amplxe-cl results in repeated kernel emergency messages along the lines of:

Uhhuh. NMI received for unknown reason xx on CPU yy.

Do you have a strange power-saving mode enabled?

Dazed and confused, but trying to continue

On the KNL systems the "unknown reason" alternates between 29 and 39, and the message typically shows up for all cores. On SKX systems the "unknown reason" typically alternates between 20 and 30, and the message also typically shows up for all cores.

The nodes don't crash -- indeed, the amplxe-cl job finishes and prints out its summary report. BUT, these messages printed by "pr_emerg()" are echoed to all root windows on the master node, where they make the system operators cranky. Cranky operators often kill the offending jobs.

On the SKX nodes, about the time the unexpected NMIs start, we see a handful of messages like:

INFO: NMI handler (perf_event_nmi_handler) took too long to run: 585758.001 msec

and sometimes:

hrtimer: interrupt took 25076958 ns

The perf_event_nmi_handler message seems weird -- 5857578 msec is almost 10 minutes, and this message appeared within 3 minutes of the start of the job. The hrtimer number (25 second) is more plausible, but no less concerning.

On the KNL nodes (running sep), there are no other interesting messages in the log -- just repetitions of the trio of "Dazed and confused" messages for the duration of the job. The log that I am staring at now repeats this trio of messages 1722 times during the 18 minutes that VTune was running, then everything appears to have returned to normal.

As a short-term workaround, I have found that collecting uncore counters "manually" using "-collect-with runsa -knob event-config=..." does data collection without generating irritating kernel messages, but I have not looked in detail at the collected data.

In the slightly longer term, we plan to install and test Intel Parallel Studio 2018 update 2 along with the corresponding SEP kernel module. Does anyone know if this is likely to provide any benefit with regard to this class of problems?

Dmitry_R_Intel1 · ‎07-10-2018

Recently, we made a tentative fix for unknown NMI issues. That hopefully help remove the issue (at least local testing I did doesn't reproduce them anymore on a machine where they were reliably reproduced before). The patch we applied is that we clear PMU control registers after every collection instead of restoring them to their original value.

The fix should appear in next VTune release.

View solution in original post

McCalpinJohn · ‎05-15-2018

A bit more information from this morning's testing....

I installed the Intel 18 Update 2 compilers and the corresponding VTune
1. Intel(R) VTune(TM) Amplifier 2018 Update 2 (build 551022) Command Line Tool
I built the kernel modules in the sepdk directory and installed them on an SKX system (Xeon Platinum 8160) running kernel 3.10.0-693.17.1
Some VTune collections worked OK, and some continued to generate the NMI "Dazed and confused" messages.
1. Two consecutive "-collect memory-access" runs worked OK.
2. The first "-collect hpc-performance" run worked OK.
3. The second "-collect hpc-performance" run generated four sets of the three messages (NMI received for unknown reason, Do you have a strange power saving mode? Dazed and confused).
  1. These happened on four different cores almost simultaneously (same 1-second time stamp in /var/log/messages) and occurred just after VTune reported amplxe: Warning: The specified data limit of 500 MB is reached. Data collection is stopped.
  2. Despite the kernel warnings, the collection completed and generated a good report.
4. A third "memory-access" run worked OK.
5. A third "hpc-performance" generated two more sets of "Dazed and confused" messages.

One thing that is a bit different in our systems is that the NMI watchdog is normally disabled. VTune re-enables it after each run. Could this result in enabled PMIs without a handler installed?

chen__chao · ‎06-28-2018

Hi John,

I met the same problem with you. I am working on Ubuntu Server with Kernel 4.5.0, and vtune_amplifier_2018.2.0.551022.

Do you know whether this problem will have impact on final profiling result ? Do you have a solution to address the problem ?

Thanks

McCalpinJohn · ‎07-09-2018

My comment about VTune enabling the NMI watchdog after each run is incorrect -- it only re-enables the NMI watchdog if the NMI watchdog was enabled before VTune was run.

It looks like these "dazed and confused" messages are benign. They don't happen very often, they happen immediately after collection has completed, and when they do happen the output files look fine (i.e., generating reports works fine).

My guess is that the sep driver does not ensure that all logical processors have completed the WRMSR instructions that disable interrupt on performance counter overflow before unregistering the VTune NMI handler. I have tried to figure this out from the VTune sep source, but (like most kernel code), it is very hard to follow the flow of control.

The messages are irritating, so I have recommended that the last two messages of each 3-message group be filtered out in /etc/rsyslog.conf. Adding these three lines to the beginning of the "RULES" section of /etc/rsyslog.conf will block all of the messages. I recommend just including the last two of these three lines, so that you don't lose track of "NMI received for unknown reason" (which can occur due to real hardware problems).

:msg, contains, "Uhhuh. NMI received for unknown reason " ~
:msg, contains, "Do you have a strange power saving mode enabled?" ~
:msg, contains, "Dazed and confused, but trying to continue" ~

After editing /etc/rsyslog.conf, restart rsyslog with

systemctl restart rsyslog

(The rsyslog control may vary by system version -- I have trouble keeping track of the changes in management infrastructure....)

Dmitry_R_Intel1 · ‎07-10-2018

Recently, we made a tentative fix for unknown NMI issues. That hopefully help remove the issue (at least local testing I did doesn't reproduce them anymore on a machine where they were reliably reproduced before). The patch we applied is that we clear PMU control registers after every collection instead of restoring them to their original value.

The fix should appear in next VTune release.