We are using the edison as a standalone embedded device, and we've noticed a boot reliability issue with the board. The board boots about 95% of the time (soft boots or hard boots have slightly different numbers). On other platforms the u boot delay is a factor, but we set this delay to 0 on the edison (confirmed via observation on the console) and found no improvement. For an unattended system, booting with several 9 reliability is quite important.
Have others seen similar boot reliability numbers? Your experience as well as suggestions on how to get to high boot reliability welcome.
Note: power cycling until successful boot is not possible because there's no person around to go through this process. Board boots successfully after a hard reset after a failed boot attempt.
Could you share some more details about your project? How's the scenario exactly? We would like to know how you're powering the board, if there's any external circuitry connected to it, the image that you're using, etc. If you're using it as standalone embedded device my guess is that you're using the Mini-breakout board, right?
Any other detail about your connection and possible software changes would be useful. Also, we would like to know how you're testing the boot reliability so we can replicate it.
We have an edison on a custom adapter board that is modeled off of the intel mini breakout board. We've added our own peripherals, and the largest (I/O) related change is that we have the usb on the go port disconnected (floating). We also use the UART and some GPIOs, but only a couple of GPIOs used in our reliability testing. It is worth noting that in our full application, and we have no problems with the edison when it is fully running. The board is powered by a switching converter on our board that outputs 4.3V.
We find that about 5% of the time, the edison doesn't completely boot. You can watch (on the serial console) the boot process, and it appears to proceed normally. However, when you enter your user at the login prompt, the edison appears to freeze after it displays the password prompt. In addition, our services launched by systemd do not seem to be running (one controls the state of an LED, and we don't see the LED states we expect), but you can see the service launched in the boot log. To add to the strangeness, at the password prompt, you can hit cntl-c and see the tty restart and a new login prompt is displayed. I have included an example boot failure log output below so you can see this behavior. There are a bunch of odd escape characters polluting the log output, but I think you'll be able see what we're seeing.
Based on prior experience, we set the boot delay parameter in u boot to -2. This eliminates the pause for user input that can prevent a normal boot. This change had no significant impact on boot reliability however.
In order to test boot reliability, we have created a simple service launched by systemd. This service reboots the edison when the uptime exceeds five minutes, and it also posts uptime information to a database so we can observe the behavior. Several edisons run this service, and they are all connected to a wall timer that toggles wall power to the power supply every 30 minutes. This gives both hard and soft reboots a test, and we can run this autonomously for hours or days at a time.
Thanks for your thoughts.
PSH KERNEL VERSION: b0182b2b
SCU IPC: 0x800000d0 0xfffce92c
PSH miaHOB version: TNG .B0 .VVBD .0000000c
microkernel built 11:24:08 Feb 5 2015
******* PSH loader *******
PCM page cache size = 192 KB
Cache Constraint = 0 Pages
Arming IPC driver ..
Adding page store pool ..
PagestoreAddr(IMR Start Address) = 0x04899000
pageStoreSize(IMR Size) = 0x00080000
*** Ready to receive application ***
U-Boot 2014.04 (Dec 30 2015 - 15:20:03)
DRAM: 980.6 MiB
MMC: tangier_sdhci: 0
Partitioning already done...
Flashing already done...
GADGET DRIVER: usb_dnl_dfu
5330528 bytes read in 132 ms (38.5 MiB/s)
Valid Boot Flag
Setup Size = 0x00003c00
Magic signature found
Using boot protocol version 2.0c
Linux kernel version 3.10.17-yocto-standard (slanzise@build) # 2 SMP PREEMPT Wed Aug 26 17:32:38 PDT 2015
Building boot_params at 0x00090000
Loading bzImage at address 00100000 (5315168 bytes)
Magic signature found
Kernel command line: "rootwait root=PARTUUID=012b3303-34ac-284d-99b4-34e03a2335f4 rootfstype=ext4 console=ttyMFD2 earlyprintk=ttyMFD2,keep loglevel=4 g_multi.ethernet_config=cdc systemd.unit=multi-user.target hardware_id=00 g_multi.iSerialNumber=7a1dea0a4b43e8c5cf4b42170ab3013b g_multi.dev_addr=02:00:86:b3:01:3b platform_mrfld_audio.audio_codec=dummy"
Starting kernel ...
[ 0.696400] pca953x 1-0020: failed reading register
[ 0.701543] pca953x 1-0021: failed reading register
[ 0.706595] pca953x 1-0022: failed reading register
[ 0.711772] pca953x 1-0023: failed reading register
[ 1.068917] snd_soc_sst_platform: Enter:sst_soc_probe
[ 1.466893] pmic_ccsm pmic_ccsm: Error reading battery profile from battid frmwrk
[ 1.475296] pmic_ccsm pmic_ccsm: Battery Over heat exception
[ 1.475381] pmic_ccsm pmic_ccsm: Battery0 temperature inside boundary
Welcome to [1mLinux [0m...
Although we don't have many, we re-ran the test using the Intel mini-breakout board to eliminate something about our specific hardware design as an issue, and we still see the same booting issue.
And the stock image downloaded from the Edison download's page also exhibits this issue on the default hardware. I'm guessing this is a systemd issue, but we haven't isolated the issue to the point where we can be sure.
Hello, Pablo. I work with slanzise and have been attempting to mitigate this boot issue by patching our meta-intel-edison layer. Based on log output at boot (pasted above) and a few similar-sounding bug reports regarding systemd [1-2], I wonder if the root cause could be a race condition somewhere in systemd during boot. Without a good way to test this theory, I went about trying to upgrade the version of systemd baked into our Yocto images, without success. I've attempted to cherrypick the latest systemd recipe from openembedded (systemd version 228 vs meta-intel-edison's 213) into our layer, but have been foiled thus far by library and kernel-module dependency fallout.
Before investing more time in this speculative upgrade, I was wondering two things:
1) Have there been any reported issues related to the version of systemd installed by meta-intel-edison (v213)?
2) Is there any precedent for upgrading this version of systemd in the meta-intel-edison layer?
Thanks for helping us look into this.
 https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1385630 Bug # 1385630 "systemd 215 hangs during boot" : Bugs : systemd package : Ubuntu
Thank you for sharing all this information. So just to try and replicate the issue, you're using the latest Edison image, right? Or at least you downloaded the one from this site https://software.intel.com/en-us/iot/hardware/edison/downloads https://software.intel.com/en-us/iot/hardware/edison/downloads, I believe.
Also, there's no need then for any external circuitry to conduct this test, right? We'll be using the Mini-Breakout board, just as you did.
Could you please provide the service used to test reliability? We would like to have your other custom services if you're ok with that, but they are not a priority.
About evanmeagher questions on systemd, we will investigate this to give you an answer.
Thanks for the reply, Pablo. Let me answer your questions in-line.
> you're using the latest Edison image, right?
That's correct. We've run our boot test (source provided below) with images from the latest official release from Intel (2.1) and with the latest meta-intel-edison Yocto layer in Git . Additionally, we extended each of these "stock" images to disable U-boot's bootdelay feature by patching the u-boot recipe in meta-intel-edison (i.e. setting `bootdelay=-2` in the relevant u-boot configuration). We've found the bootdelay feature to be problematic on other SoC platforms, wherein noise on a serial line manifests as input which irrevocably pauses the boot sequence.
We observe the same ~95% boot reliability with all four of these images.
> there's no need then for any external circuitry to conduct this test, right?
Correct, our testing was done with a stock Mini-Breakout board. We've also run tests with our custom adapter board, which as slanzise mentioned above, is based on Intel's Mini-Breakout board.
> Could you please provide the service used to test reliability?
Here is the Python source code of our test, with the server interaction removed: https://gist.github.com/evnm/fd127a047ddb74042edb wifi_status.py · GitHub
As slanzise described, this script boils down to a loop which posts wifi signal strength and system uptime to our server and reboots the machine after uptime has exceeded five minutes. This script is wired into systemd with the wifi_status.service file included in the above gist. It's worth mentioning that we've observed boot failures with the same symptoms (login prompt accepting input, but hangs after password receipt) when this Python script is not installed, so it doesn't seem to be an issue related to our test itself.
Devices running this service are attached to a wall timer which toggles power every 30 minutes. Thus, we're able to test six soft reboots and one hard reboot per device per hour.
 http://git.yoctoproject.org/cgit/cgit.cgi/meta-intel-edison/ meta-intel-edison - Layer for the Intel Edison Development Platform
We already have some more information to share with you. We set up a simple environment to test the boot reliability of Edison. We modified the code by removing everything but the code to obtain the system uptime and then checking if the system has been up for five minutes. If so, the shutdown command will be executed. We didn't encounter any issue at the time of reboot, the board kept running for hours. So we are assuming the issue is related to the other part of your script.
Please let us know if you want us to facilitate the code that was used.
Thanks for looking into this. I'm not clear on your result. How many reboots did the system undergo without a boot failure? With a single edison rebooting every 5 minutes with a 95% success probability, it can take quite a long time to see a failure. For example, after 60 boots (a little over 3 hours assuming a reboot after 5 minutes of uptime), there's still a 5% change you wouldn't have seen a failure.
If you can share your code, we will run it here to verify we have the same performance.
Thanks again for your attention.
We already try running your code, we made some little changes so that it would run. In the first test, the boot was unsuccessful 2 out of 20 times. In the second test (still with your code) the boot was unsuccessful 2 out of 10 times. We even got higher numbers that you got.
After this, we modified the code and left only the necessary parts for it to reboot every 5 minutes. We didn't find any issue using this code. You can find the script and the service attached.