This is a PSA to let impacted devs know that a kernel bug in fully patched MacOS makes AVX-512 unsafe to use under most circumstances.
Apple has been recently notified, but it is unclear when (or if) this issue will be resolved.
The issue occurs any time a signal is dispatched by the kernel to a process with a registered signal handler. Upon return, the K0-K7 AVX-512 opmask registers in the preempted thread are not properly restored. This leads to unexpected and difficult to debug intermittent failures in AVX-512 optimized code paths that use opmasks.
Because optimized libraries do not control the signal handling configuration of the apps they are linked into, there is major potential for potentially rare, intermittent undefined behavior at runtime (data corruption, buffer overruns, truncated results, subtly incorrect calculations, etc.)
Some programming environments make extensive use of signals in their runtimes (e.g. Golang) and so all applications written with such languages that use AVX-512 are impacted by this issue.
This issue appears to go all of the way back to MacOS High Sierra, when AVX-512 kernel support was added, and persists up to and including currently support versions of Catalina and Big Sur. Testing has not yet occurred on Monterrey, but it is assumed to be impacted as well.
Affected Apple Hardware includes:
- iMac Pro (late 2017) discontinued early 2021
- Mac Pro (late 2019) still shipping
- MacBook Pro 13” Ice Lake (late 2020) discontinued late 2021
A minimal reproduction is available here:
The bug was originally reported/isolated here. See thread for Apple bug report near bottom.
Note, the Darwin kernel bug causing this issue has been spotted, considerably clarifying the scope of the issue:
- The scope of the bug in time is now clear. Catalina 10.15.6 was released in mid-July 2020. So this has been in the wild for only around 16 months.
- The scope of the bug in terms of the sufficient processor state is considerably narrowed: it only occurs when all bits in both the ZMM_Hi256 and Hi16_ZMM state are zero. There are still real valid cases when that can be true, but a lot fewer than were originally assumed.
- The scope of the bug in terms of how it is triggered has very likely expanded. All of our reproductions depend on BSD signal handling to produce an easily repeatable trigger. But the implicated kernel code is not specific to signal handler returns. It seems likely that other kernel mechanisms that require restoring thread state could also trigger this. Perhaps even ordinary multitasking preemption from the Darwin thread scheduler.
See here for more details: https://github.com/golang/go/issues/49233#issuecomment-964422587
A quick update: testing indicates that this issue seems to have been resolved with the release of MacOS 12.2.0, within the corresponding kernel update to Darwin 21.3.0. The corresponding open source Xnu kernel archive has not yet been updated for this release, so this conclusion is based on prior reproducing tests now passing, not inspection of kernel source level changes showing an explicit fix.
Note however! MacOS 10.15.x (Catalina) and MacOS 11.x (Big Sur) security patches released at the same time as MacOS 12.2 (Monterey) did not contain corresponding kernel updates, and testing indicates that the fully patched Darwin 20.6.0 and 19.6.0 kernels in Catalina and Big Sur remain susceptible to this issue. There is no indication if or when Apple intends to fix this issue in pre-Monterey versions of MacOS that are still receiving updates.