I'm troubleshooting an issue on an Arria 10 SOC board running the Linux 4.20 kernel where we randomly see corruption reported on filesystems located on the eMMC. It doesn't happen frequently and is quite indeterminate but after many reboots, one or more of the filesystems that are mounted report errors (even on filesystems mounted as read-only). As part of tracking down the issue (and having not narrowed it down to the eMMC at the time), we tried other filesystem types (ext3->ext4 and ext3->cramfs) but the problem persisted.
An interesting data point is that previously we were using the Linux 3.10 kernel and did NOT see this issue. We went back and confirmed that the issue does not occur when using the 3.10 kernel by testing multiple boards over the course of 10 days. Given this and other debugging performed, the issue appears related to MMC driver changes in the 4.20 kernel (which appear to be substantial based on a cursory comparison of the two versions).
I'm currently performing an mmc controller register comparison and added CMD logging to the MMC kernel driver for the purposes of comparing the CMDs being sent during configuration.
Has anyone encountered this issue? Any suggestions on things to try?
If you have any questions for me, please feel free to ask.
Based on my experience, I am unaware if there are files corruption. Also, which Quartus and SoC EDS version are you currently using?
Thanks for the clarification regarding the kernel versions, firstly I recommend that you try the kernel version available in our Github page, one of the available you could try is version 4.14:
Please let me know if the above still doesn't fix the issue.
We are currently using Quartus180pro
The full version is :
Quartus Prime Version 18.0.1 Build 261 06/28/2018 Patches 1.20,1.34,1.45 SJ Pro Edition
We have tried using the 4.14-lt release but it had the same issue.
Was there any particular error seen during any bootup?
Did you spot any differences in the mmc controller register?
I will check if there were any driver missing or any alternatives from our internal team regarding this, can you share the part number of the eMMC device?
No errors from the MMC kernel driver. The only errors are related to the filesystem(s) when they are mounted.
I do see differences in the MMC controller registers and I'm testing with changes now.
The part number of the eMMC device is MTFC4GACAANA-4M IT.
Based on my checking, the part number was not really tested thus we are unsure the full compatibility with the latest kernel, I will check further for any information from our internal team, but it will take some time.
If you could share the any information regarding the MMC controller registers, that would help.
Also could you share what sort of file corruption that you are seeing? Or if you could screenshot it if that is easier for you.
Okay, thank you for checking into the eMMC device compatibility.
Regarding the MMC controller registers, the differences between Linux 3.10 and Linux 4.20 are:
- The wait_priv_data bit in the CMD register seems to be normally set in Linux 3.10 but not in Linux 4.20.
- The msize field and rx and tx watermark fields in the FIFOTH register are set differently between Linux 3.10 and Linux 4.20. For example, nominally I see a value of 0x21ff0200 for Linux 3.10 and a value of 0x607f0200 for Linux 4.20.
- Interestingly, the PWREN register has a value of 0 for Linux 3.10 but a value of 1 (which I would expect) for Linux 4.20.
As a test, I tried modifying these registers for the Linux 4.20 based image to match those for Linux 3.10 but it did not help.
In addition, I noticed differences in the ext CSD mode set for the eMMC itself between Linux 3.10 and Linux 4.20. For Linux 3.10, HPI_MGMT is enabled but in Linux 4.20 it is not. Also, for Linux 4.20 CACHE_CTRL is enabled but it is not in Linux 3.10. Finally, for Linux 4.20 POWER_OFF_NOTIFICATION is enabled but it is not in Linux 3.10.
Also as a test, I tried modifying the switch commands used to configure these mode settings in the ext CSD for the eMMC device to match Linux 3.10 but it did not help.
Some examples of file corruption that are seen (note that the occurrences are random):
[ 1.130239] Waiting for root device /dev/mmcblk0p5...
[ 1.136248] mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 52000000Hz, actual 50000000HZ div = 0)
[ 1.150205] mmc0: new high speed MMC card at address 0001
[ 1.156915] mmcblk0: mmc0:0001 P1XXXX 3.60 GiB
[ 1.163152] mmcblk0boot0: mmc0:0001 P1XXXX partition 1 16.0 MiB
[ 1.170272] mmcblk0boot1: mmc0:0001 P1XXXX partition 2 16.0 MiB
[ 1.278405] mmcblk0: p1 p2 p3 p4 < >
[ 1.312344] VFS: Cannot open root device "mmcblk0p5" or unknown-block(179,5): error -6
[ 1.320237] Please append a correct "root=" boot option; here are the available partitions:
[ 1.328661] 0100 8192 ram0
[ 1.328663] (driver?)
[ 1.334759] 0101 8192 ram1
[ 1.334761] (driver?)
[ 1.340838] b300 3776512 mmcblk0
[ 1.340840] driver: mmcblk
[ 1.347618] b301 143360 mmcblk0p1 55af3d06-01
[ 1.354398] b302 1024 mmcblk0p2 55af3d06-02
[ 1.361168] b303 143360 mmcblk0p3 55af3d06-03
[ 1.367945] b304 1 mmcblk0p4
[ 1.373775] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(179,5)
[ 46.336168] EXT4-fs error (device mmcblk0p5): htree_dirblock_to_tree:1007: inode #1999: block 10117: comm charon: bad entry in directory: directory entry overrun - offset=0, inode=3892174576, rec_len=57072, name_len=253, size=1024
[ 305.111761] EXT4-fs (mmcblk0p5): error count since last fsck: 1
[ 305.117683] EXT4-fs (mmcblk0p5): initial error at time 46: htree_dirblock_to_tree:1007: inode 1999: block 10117
[ 305.127746] EXT4-fs (mmcblk0p5): last error at time 46: htree_dirblock_to_tree:1007: inode 1999: block 10117
[ 1.257384] mmc0: new high speed MMC card at address 0001
[ 1.264015] mmcblk0: mmc0:0001 P1XXXX 3.60 GiB
[ 1.269703] mmcblk0boot0: mmc0:0001 P1XXXX partition 1 16.0 MiB
[ 1.276776] mmcblk0boot1: mmc0:0001 P1XXXX partition 2 16.0 MiB
[ 1.385343] mmcblk0: p1 p2 p3 p4 < p5 >
[ 1.402715] EXT4-fs (mmcblk0p5): mounting ext3 file system using the ext4 subsystem
[ 1.511161] EXT4-fs (mmcblk0p5): ext4_check_descriptors: Block bitmap for group 0 not in group (block 0)!
[ 1.520713] EXT4-fs (mmcblk0p5): group descriptors corrupted!
[ 1.526557] VFS: Cannot open root device "mmcblk0p5" or unknown-block(179,5): error -117
[ 1.534634] Please append a correct "root=" boot option; here are the available partitions:
[ 87.942916] EXT4-fs error (device mmcblk0p5): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 81 vs 82 free clusters
[ 92.464179] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p5, blocknr = 1). There's a risk of filesystem corruption in case of system crash.
[ 92.488648] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p5, blocknr = 1). There's a risk of filesystem corruption in case of system crash.
Let me know if you need more information.
Thanks, I will check some more regarding the log you have provided.
Also, checking with our internal team if they have face similar error logs before.
It will take some time, I shall come back with some findings.
Thanks very much for your help.
Could you provide a list of eMMC devices that were tested with the Linux 4 kernel? I think it could be an interesting test for us to try with one of the known supported devices.
We do have a list for support flash devices for Arria 10 SoC here (bottom of the page for eMMC):
I am still getting info on the device you are using if any of our internal team has tested it before.