I'm experiencing weird crashes (screen freeze/cpu soft lockup/random app segfaults) on linux, mainly when the workload involves video decoding (my best reproducer at the moment is playing two videos with `mpv` in parallel, these crash in about 10 minutes and often bring the whole computer down with them)
This is not just a graphics/i915 bug despite what I first thought: playing the videos with -vo null (no video output: no opengl/graphics involved) crashes mpv as well.This seems however to have less chances to kill everything else, but at this point it might be down to luck.
I've also experienced crashes with a single video when multitasking, or just with firefox. Others using the same computer have reported the problem here: https://forums.puri.sm/t/is-anyone-else-experiencing-freezing-issues-with-librem-15-v3/1233
I think this is related to cpu frequency changes, because setting the cpu governor to performance works around the issue perfectly: I've never been able to crash when this setting is on. I've also had a stable usage with only one cpu (offlining the other 3 cores) even with the ondemand governor.
For what it's worth, the "BIOS" is coreboot. It should not be "locking" anything, so linux is free to activate features as it finds them.
- I have run the Intel® Processor Diagnostic Tool (64-bit), which passed (I ran it multiple times to be sure)
- I have run memtest86, because random crashes can be due to faulty ram, which did not find any defect in 10 hours (had time for multiple passes as well)
- I have attached the output of the Intel® System Support Utility script, please note that the kernel there is old but I have reproduced the behavior with multiple kernels: 4.4.88, fedora 25's 4.8.6-300.fc25.x86_64, debian's 4.12.0-2-amd64, upstream 4.14.0-rc2
- I should have the latest available microcode (20170707 release, /sys/devices/system/cpu/cpu0/microcode/version tells me 0xba)
I am honestly out of idea on what to try next. For starters at least my computer is useable if I restrict myself to performance governor when plugged in / 1 core when on battery, but this is not a proper solution and I'd like to understand what's happening.
I'm obviously willing to test more things or help futher diagnose the issue if possible, guidance is welcome though!
Some more info,
mpv version (debian testing's):
mpv 0.26.0 (C) 2000-2017 mpv/MPlayer/mplayer2 projects
built on UNKNOWN
ffmpeg library versions:
ffmpeg version: 3.3.4-1
Example of crash, logs from this morning:
Sep 29 08:03:58 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000246
Sep 29 08:03:58 kernel: IP: __list_del_entry_valid+0x29/0x90
Sep 29 08:03:58 kernel: PGD 0 P4D 0
Sep 29 08:03:58 kernel: Oops: 0000 [# 1] SMP
Sep 29 08:03:58 kernel: Modules linked in: ctr ccm fuse cpufreq_powersave cpufreq_userspace cpufreq_conservative snd_hda_codec_hdmi ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt arc4 nf_conntrack_ipv6 ath9k nf_defrag_ipv6 ath9k_common ipt_REJECT nf_reject_ipv4 ath9k_hw nf_log_ipv4 nf_log_common xt_LOG xt_recent ath xt_limit xt_tcpudp snd_soc_skl mac80211 xt_addrtype snd_soc_skl_ipc snd_hda_codec_realtek snd_hda_codec_generic snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_sst_match snd_soc_core intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp snd_hda_codec coretemp kvm_intel snd_hwdep snd_hda_core kvm cfg80211 snd_pcm snd_timer irqbypass snd intel_cstate intel_uncore joydev intel_rapl_perf pcspkr serio_raw sg iTCO_wdt iTCO_vendor_support soundcore rfkill nf_conntrack_ipv4 nf_defrag_ipv4
Sep 29 08:03:58 kernel: xt_conntrack shpchp intel_pch_thermal battery ac topstar_laptop sparse_keymap processor_thermal_device evdev intel_soc_dts_iosf int340x_thermal_zone ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack libcrc32c crc32c_generic parport_pc ppdev lp parport iptable_filter ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_decompress zstd_compress xxhash raid6_pq algif_skcipher af_alg dm_crypt dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel i915 ghash_clmulni_intel pcbc video i2c_algo_bit drm_kms_helper i2c_i801 psmouse aesni_intel prime_numbers ahci xhci_pci aes_x86_64 crypto_simd cryptd glue_helper libahci nvme xhci_hcd drm libata nvme_core usbcore scsi_mod button
Sep 29 08:03:58 kernel: CPU: 1 PID: 5781 Comm: mpv/ao Tainted: G W 4.14.0-rc2 # 14
Sep 29 08:03:58 kernel: Hardware name: Purism Librem 15 v3/Librem 15 v3, BIOS 4.6-a86d1b-Purism-5 07/27/2017
Sep 29 08:03:58 kernel: task: ffff924822b20040 task.stack: ffffa4fa83880000
Sep 29 08:03:58 kernel: RIP: 0010:__list_del_entry_valid+0x29/0x90
Sep 29 08:03:58 kernel: RSP: 0018:ffffa4fa83883cb0 EFLAGS: 00010203
Sep 29 08:03:58 kernel: RAX: 0000000000000000 RBX: ffffa4fa837fbd58 RCX: dead000000000200
Sep 29 08:03:58 kernel: RDX: 0000000000000246 RSI: ffffa4fa80d88448 RDI: ffffa4fa837fbd60
Sep 29 08:03:58 kernel: RBP: ffffa4fa83883cb0 R08: ffffa4fa837fbdb8 R09: ffffa4fa80d88448
Sep 29 08:03:58 kernel: R10: 0000000000000001 R11: 000000007fffffff R12: ffffa4fa837fbd60
Sep 29 08:03:58 kernel: R13: ffffa4fa837fbdd0 R14: ffffa4fa837fbdc0 R15: ffffa4fa80d88448
Sep 29 08:03:58 kernel: FS: 00007f54175c0700(0000) GS:ffff92483ec80000(0000) knlGS:0000000000000000
Sep 29 08:03:58 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 29 08:03:58 kernel: CR2: 0000000000000246 CR3: 000000026c196004 CR4: 00000000003606e0
Sep 29 08:03:58 kernel: Call Trace:
Sep 29 08:03:58 kernel: plist_del+0x3b/0xc0
Sep 29 08:03:58 kernel: __unqueue_futex+0x2f/0x40
Sep 29 08:03:58 kernel: mark_wake_futex+0x3d/0x50
Sep 29 08:03:58 kernel: futex_requeue+0x8a9/0xa40
Sep 29 08:03:58 kernel: do_futex+0x2ae/0xb10
Sep 29 08:03:58 kernel: SyS_futex+0x13b/0x180
Sep 29 08:03:58 kernel: ? SyS_write+0x79/0xc0
Sep 29 08:03:58 kernel: entry_SYSCALL_64_fastpath+0x1e/0xa9
Sep 29 08:03:58 kernel: RIP: 0033:0x7f5454a2d91d
Sep 29 08:03:58 kernel: RSP: 002b:00007f54175bf8e8 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
Sep 29 08:03:58 kernel: RAX: ffffffffffffffda RBX: 0000560690f967a0 RCX: 00007f5454a2d91d
Sep 29 08:03:58 kernel: RDX: 0000000000000001 RSI: 0000000000000084 RDI: 0000560690847fbc
Sep 29 08:03:58 kernel: RBP: 0000560690f96938 R08: 0000560690847f90 R09: 000000000001a394
Sep 29 08:03:58 kernel: R10: 000000007fffffff R11: 0000000000000283 R12: 0000000000000e50
Sep 29 08:03:58 kernel: R13: 0000560690f95a78 R14: 0000560690f95a70 R15: 0000560690f625c0
Sep 29 08:03:58 kernel: Code: 00 00 55 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 89 e5 48 39 c8 74 27 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2c <48> 8b 32 48 39 fe 75 35 48 8b 50 08 48 39 f2 75 40 b8 01 00 00
Sep 29 08:03:58 kernel: RIP: __list_del_entry_valid+0x29/0x90 RSP: ffffa4fa83883cb0
Thank you for using the Intel(R) Communities.
I understand you are facing system crashes while having two videos playing and some other scenarios.
In this case, it would be recommended to have this inquiry handled by the Linux*/Graphics support team to have their expertise handling your problem.
That was my first thought as well, but since I was able to reproduce with playing without any output, I do not believe that video is involved ; `mpv -vo null` really only just reads the file, decodes it and throws the output away. There is no graphics acceleration, no openGL... The buffer is just not dispalyed.
I do not think it is their time to shine here
Decoding videos (in this case h264) is a very complex operation, mpv uses ffmpeg which has been optimizing the process a lot.
Part of the code is written in assembly with sse vectorial instructions and things that have been known to trigger "complex loads" which have led to crashes in the past (cf. the prime95 freeze that is very famous)
Without help I will keep trying to minimize the reproducer, I'll try to take the code out of ffmpeg and run it in a loop maybe, but this is a lot of work. I'm especially perplexed by the seemingly relation to cpu frequency changes requirement here.
Dominique Martinet | Asmadeus
By the way, I said two videos but it actually depends on the actual media being played, basically adjusting to make the load big enough to force the cpu frequency to increase a bit but small enough to have it reduce back down "often", ideally getting it to change back and forth from ~1ishGHz to ~2.5GHz is what has been giving me best results.
To give concrete examples, using a slighly more intensive video (60fps@1080p, for example http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_60fps_normal.mp4 http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_60fps_normal.mp4 ), I can get it to crash with just one player.
With a smaller one (e.g. older 24fps@720p http://download.blender.org/peach/bigbuckbunny_movies/big_buck_bunny_720p_h264.mov http://download.blender.org/peach/bigbuckbunny_movies/big_buck_bunny_720p_h264.mov ), I actually needed to start three instances of mpv to get the CPU frequency to oscillate and crash "quickly".
These files are pretty famous and have been playing fine for over half a day/whole night with either 3/4 CPUs turned offline (writting 0 to `/sys/devices/system/cpu/cpu[1-3]/online`) or with `cpufreq-set -g performance`, so I am confident the both the media and this version of mpv/ffmpeg are fine by themselves.
It is only when using multiple cores at varying frequencies that I get frequent crashes.
Once again, that link is for intel *graphics* component, wheras I do not use the graphics driver at all. I could blacklist all drm/i915 modules and still reproduce crashes. (hm, I need to actually try, ok. I will report back tonight)
If you have an actual overall linux support I would not mind switching, but the linuxgraphics support "forum" has nothing to do with the CPU itself. They will just laugh at me if I complain about a non-graphics issue there.
What do you actually need to take this report seriously? I'd honestly rather not have to install windows on this laptop, I guess there are trial versions that do not need a license but it is as much a matter of principle...
> I could blacklist all drm/i915 modules and still reproduce crashes
I can confirm this part, I removed all graphics kernel modules (drm.ko, drm_kms_helper.ko and i915.ko) ; rebuilt initrd ; rebooted in single mode (X wouldn't start anymore) and reproduced just fine.
There is no graphics involved in this bug. It is about h264 decoding instructions and CPU frequency changes.
Thank you for the answers provided.
I have proceeded to perform a test with the same configuration you have but within a Windows 10* environment.
These would be the results:
At this point I would like to check, are there certain video configurations within the player that have been changed? I can certainly reproduce them to see if the problem happens with the official driver from our site.
Some screenshots have been attached.
Thank you very much for giving it a try.
I have not changed any option with mpv, I believe that even for windows it should use ffmpeg and similar acceleration instructions (sse or similar).
What I did notice as being important to reproduce, though, is that the CPU frequently changes frequency during the playback. I am not sure how to check under windows but linux has an interface file called `/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq` (one per core) that gives the current CPU frequency, and basically I could only get crashes if the frequency would be alternating regularly between something along 1GHz to 2.8+GHz. I believe windows should exhibit similar frequency swings if the laptop is configured in power saving mode (it should actually be done in hardware, through dynamic voltage & frequency scaling (dvfs)), but you might need slightly more usage than 20% if that is what you were observing - have you tried opening the video multiple times in parallel?
The CPU model is also different, I am not sure how specific this issue might be. I have other laptops/NUCs with intel CPUs and have never experienced such an issue.
For information, I have asked the laptop makers to see if they can reproduce the issue on a wider range of models to check if this would be a bad series as well (there are only a handful of reports at this point). I will report when I hear back from them.
I am also starting to think it could be related to the mother board itself, for example if the CPU input voltage is not steady enough - I believe sudden frequency increase combined with power hungry instructions could also cause unwanted voltage fluctuations, which might this kind of behavior. I do not know how to confirm or infirm that though. I will try to look with an oscilloscope towards the end of the month if I can find a suitable pin to probe close enough to the CPU (do not hold your breath)
Dominique Martinet | Asmadeus
Thank you for the reply, Asmadeus
Would you please clarify what do you mean by: "I believe that even for windows it should use ffmpeg and similar acceleration instructions (sse or similar)" Is that something you believe should be added or is it something that is used by mpv already?
About the CPU frequency changes
The frequency faced during the tests performed (new tests) did change, but that is a regular behavior, depending on the tasks performed by the CPU, the frequency can change.
These frequency changes happened, and no crashes or performance problems were faced in a windows 10 environment.
Related to the power plan
The system used has Balanced mode only and this is intended for the Surface Pro 4*.
Three instances of the video were played while the tests were performed, with mpv and with VLC players.
About the CPU model used
Are you confirming that other systems with the same OS configuration do not face the problem?
It is great to hear you have reached the manufacturer of the system to get this tested. Please do report back when results are present.
There could be a possibility of the motherboard affecting in that way you mentioned (CPU voltage management)
replying in order:
mpv and ffmpeg/sse instructions: It's something mpv/vlc already should do by default so I do not think anything will change.
Frequency changes: ok. frequency changes do appear to work here, it just looks like it needs a lof of them to exhibit crashes.
Power plans: I'm not sure if the question was for me, but using a "performance" power plan (CPU always stays close to the maximum frequency) I have not experienced any crash, which is why I believe these "frequency swings" are important.
CPU models: Yes, I have the same system on another skylake CPU (an intel NUC with i5-6260U) as well as another older laptop (not skylake though) running the same software with no issue.
I'll update here when I have heard back from the manufacturer.
Dominique Martinet | Asmadeus
Thank you for reporting back, Asmadeus
Lets see what the manufacturer says since there is the possibility where this is related to a single configuration so then we can proceed accordingly.
This is to do a follow up to your inquiry and find out if you have further questions or if the system's manufacturer has provided some details.
Thanks for the follow up.
I haven't had much replies, but I'm still traveling so I can't investigate as much as I would like right now.
I'll hopefully have more details around the end of the month.
Dominique Martinet | Asmadeus