Here is an interesting situation that we've seen pop up with about 5 devices -- all of these failures occurred in the last several weeks (since late January 2016).
I wondered if anyone had experienced something similar. The most obvious solution is to run the firmware update from DFU mode, which I'm assuming should work to recover these units.
But the main question: has anyone had these units stop working completely at random, and then hang at this place on boot? If so, any known causes?
These units had uptimes ranging from 38 days all the way to 170 days. We are running a mix of software, python, java, nodejs. I'm sure there could be memory leaks which could lead to crash, but I don't know how it would corrupt the image.
I also wondered if there was a recent time (1 second?) adjustment that might effect Yocto in some harmful way. But these failures didn't happen at the same moment. Or perhaps on this image version there were bugs that have been fixed since, and its reasonable to think this has already been solved?
Below is the console output, obviously it won't boot into yocto and is stuck until I restore the firmware ... thanks for the help!
microkernel built 11:24:08 Feb 5 2015
******* PSH loader *******
PCM page cache size = 192 KB
Cache Constraint = 0 Pages
Arming IPC driver ..
Adding page store pool ..
PagestoreAddr(IMR Start Address) = 0x04899000
pageStoreSize(IMR Size) = 0x00080000
*** Ready to receive application ***
U-Boot 2014.04 (Apr 29 2015 - 03:53:19)
DRAM: 980.6 MiB
MMC: tangier_sdhci: 0
Hit any key to stop autoboot: 0
Partitioning already done...
Flashing already done...
GADGET DRIVER: usb_dnl_dfu
5383904 bytes read in 133 ms (38.6 MiB/s)
Valid Boot Flag
Setup Size = 0x00003c00
Magic signature found
Using boot protocol version 2.0c
Linux kernel version 3.10.17-poky-edison+ (sys_dswci@tlsndgbuild004) # 1 SMP PREEMPT Wed Apr 29 03:54:01 CEST 2015
Building boot_params at 0x00090000
Loading bzImage at address 00100000 (5368544 bytes)
Magic signature found
Kernel command line: "root=PARTUUID=012b3303-34ac-284d-99b4-34e03a2335f4 rootfstype=ext4 console=ttyMFD2 earlyprintk=ttyMFD2,keep loglevel=4 systemd.unit=multi-user.target hardware_id=00 g_multi.iSerialNumber=fbfb587a17211b3cc3312b0c682ba577"
Starting kernel ...
[ 0.760318] pca953x 1-0020: failed reading register
[ 0.765532] pca953x 1-0021: failed reading register
[ 0.770630] pca953x 1-0022: failed reading register
[ 0.775747] pca953x 1-0023: failed reading register
[ 1.614815] snd_soc_sst_platform: Enter:sst_soc_probe
[ 2.018010] pmic_ccsm pmic_ccsm: Error reading battery profile from battid frmwrk
[ 2.026390] pmic_ccsm pmic_ccsm: Battery Over heat exception
[ 2.026475] pmic_ccsm pmic_ccsm: Battery0 temperature inside boundary
[ 2.040372] pmic_ccsm pmic_ccsm: Battery temperature zone changed
[ 2.046552] pmic_ccsm pmic_ccsm: Battery0 temperature inside boundary
You said that there are 5 boards with the same problem; do all the boards have the same configuration?
I would like to know more about the environment and configuration you are using.
1. How are you using the board? What kind of project do you have?
2. How are you powering the board?
3. Have you detected power problems like current peaks or others?
4. Which image are you using on the board? (run configure_edison --version)
5. Which expansion board are you using?
6. Is that the full log the board sends?
7. Are the boards getting hot?
Please let us know as much you can and all the information you think it could help us to identify the source of the problem
Sure thing -- here is some additional info:
1 - We have about a dozen of these units running, sort of testing some sensors and just saving the data. Its connected to the breakout board, then connected to the sparkfun i2c board. then a simple custom board with an ADC chip and some LEDs, and a li-ion battery.
2 - Powered through USB on the breakout board, which is left plugged in 24/7 but also has li-ion battery attached for power outages.
3 - I haven't tested the current draw in any real length of time, other than sporadic DMM tests, which tend to read max around ~370mA, avg ~110mA (using wifi)
4 - I believe we used this file to update the images that are loaded on these boards: edison-image-ww18-15, however its very possible we may have been using edison-image-ww05-15. When I run configure_edison --version on a unit that was flashed around the same time, it yields "146"
5 - see above, just the Intel breakout board
6 - yes, thats the full log through the FTDI port and then it hangs
7 - I haven't checked the board temperature. I could try to test at this point, however it would obviously be after the error / failure occured.
Thanks for any additional info you might have!
I suggest you to check the configuration you are using with the Edison (Edison + Breakout Board + SparkFun Board + Custom Board + Battery)
How are you using the battery? Are you manually controlling the functionality of the battery between work-mode and charge-mode?
There could be a problem in the way you are powering the board that may have caused the behavior you mentioned.
Also, you said that you are running a mix of software, python, java and nodejs; are you working or editing services in the board?
Are you just reading analog sensors with your board?
One way to check if the problem is with the hardware configuration or with the mix software that is running is:
Flash two modules with the same image, one with all the code you are using and the other one without it. Both modules with the same hardware configuration. Have you tried this?
I would like to know if you were able to recover the functionality of the board with the flashall script.
Hi Pablo and Carlos --
Although I haven't been able to test anymore, I can try to answer a few of the questions you raised --
You asked: "How are you using the battery? Are you manually controlling the functionality of the battery between work-mode and charge-mode?"
> We are just attaching a 2600 mAh battery to J2 on the breakout board. I have the therm (J1) jumpered. I'm confused as to your comment about changing modes, as the onboard TI BQ24074 handles the charge cycles automatically. Via what means could we control the behavior of the BQ24074?
You mentioned: "There could be a problem in the way you are powering the board that may have caused the behavior you mentioned."
> May I ask what in the logs did you see that points you in this direction?
Your question: "Also, you said that you are running a mix of software, python, java and nodejs; are you working or editing services in the board?"
> We are running 1 service, but all that service does is fire a script initially upon boot to load our software.
Your question: "Are you just reading analog sensors with your board?"
> Yes we are utilizing the Sparkfun I2C block to connect to a ADC chip. There are two LEDs that we are also using that are connected to two differnt GPIOs to drive the LEDs. Otherwise, everything else is software and is using wifi/bluetooth.
In terms of running a test, and using a version without software, it isn't really feasible in this scenario, as some of these issues popped up after being online for ~170 days. Furthermore, only some of these devices (about 4 of them, of about 20 devices) have had this issue.
I will attempt to run the flashall script in the near future. In the meantime, I was just hoping someone else may have seen an issue like this and found a reason.
There are blocks for batteries like: https://learn.sparkfun.com/tutorials/sparkfun-blocks-for-intel-edison---battery-block SparkFun Blocks for Intel® Edison - Battery Block that allows controlling the power with a switch so I wanted to know if you were using something like this.
I mentioned that there could be a problem with how you were powering the board due to the problem is the same in different boards and there aren't changes in the scripts they are running. If the code and services are not changing, the problem shouldn't be related with the scripts if they are just reading analog values and if they worked for ~170 days; if there is a problem with them the issue should have appeared before.
The problem could be that the board lost the power for a time and when it booted, loaded the script and service, a conflict occurred and it asked for resources that the module didn't have available at that time. This could have done that the service and scripts you were running had caused the corrupted image.
As you don't have log files regarding the problem in the moment it happened I think we just can guess of what was the source of the problem. If you want you can create another script that sends information about the service, power, script, board, etc. to a server, so if another board fails you could see what the board was running, the battery-level, the state of the pins and other files that may help to find the root of the problem.