Community
cancel
Showing results for 
Search instead for 
Did you mean: 
CSkeg
Novice
1,443 Views

SPI bug (program freezes and becomes unkillable) during Edison startup

TL;DR: I ran into a SPI issue on the Edison involving my program hanging and becoming unkillable if it starts early enough in the startup process. The workaround is forcing it to wait until later to start.

I'm using libmraa through its python wrapper. My code looks, very approximately (these are the relevant snippets of the larger project), like this:

# to init SPI

# fix power-on SPI glitches

os.system("echo on >/sys/devices/pci0000\:00/0000\:00\:07.1/power/control")

spi = mraa.Spi(0)

spi.mode(mraa.SPI_MODE3)

spi.frequency(8000000)

# to write SPI, which occurs about once every 10 milliseconds

byte_array = bytearray(396)

# ... populate byte array ...

spi.write(byte_array)

I manage this code with (more or less) the following systemd unit:

[Unit]

Description=CFRS Main Application

Requires=bluetooth.target bluetooth.service pulseaudio.service

After=bluetooth.target bluetooth.service

[Service]

ExecStart=/usr/bin/python2 /home/root/cfrs/main.py

Restart=always

[Install]

WantedBy=multi-user.target

The program becomes immediately unresponsive from other devices attempting to contact it over its serial interface, which runs in a different thread from the SPI handler. When I attempt to run "systemctl stop" on this unit, "systemctl stop" waits indefinitely, and I have to kill it (^C). At this point, running "systemctl status" on this unit displays a result similar to the following:

root@cfrs-edison-alpha:~# systemctl status cfrs

==> cfrs.service - CFRS Main Application

Loaded: loaded (/usr/lib/systemd/system/cfrs.service; enabled)

Active: deactivating (stop-sigterm) since Fri 2016-07-01 21:37:42 UTC; 1min 3s ago

Main PID: 204 (python2)

CGroup: /system.slice/cfrs.service

==> 204 [python2]

This means that it's attempting to kill the process, but it's not working. I can then try "killall -9 python2", but the program stays there:

root@cfrs-edison-alpha:~# ps | grep python

204 root 0 Z [python2]

After a while, the kernel prints out the following on the serial console:

[ 240.630970] INFO: task kworker/u4:2:74 blocked for more than 120 seconds.

[ 240.631063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 240.632216] INFO: task python2:290 blocked for more than 120 seconds.

[ 240.632277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Here is the complete log showing this issue: http://pastebin.com/Wv3cd28Q http://pastebin.com/Wv3cd28Q - this includes startup messages, the dmesg output from the errors that occur, and some more attempts at poking the process and understanding why it isn't working. The program does not appear to ever recover, and a hard reboot is necessary - a soft reboot would also work, except that it waits a couple of minutes to try to kill the program first.

I resolved this with the workaround of delaying the program's start until later. I changed the After line of the systemd unit to the following:

After=bluetooth.target bluetooth.service multi-user.target

This works around the issue, because the program will wait until the rest of the system has started up, at which point the issue does not appear to occur.

I'm using the release iot-devkit-prof-dev-image-edison-20160606.zip, which contains kernel 3.10.98-poky-edison+, which is a recent enough release that it seems like it should include any recent SPI fixes.

Is there any way to actually resolve this problem? I would rather have the program start up as soon as possible, rather than waiting a bunch of extra time.

0 Kudos
6 Replies
idata
Community Manager
71 Views

Hi celskeggs,

 

 

Is it possible to have your original code to check this behavior? We would like to run some tests on your issue to see what the problem is exactly. Any other detail that you could provide would be very helpful.

 

 

Regards,

 

-Pablo
CSkeg
Novice
71 Views

Hi!

Unfortunately, my original code is large and not something I'm allowed to share. I've put together a short example that demonstrates the issue:

import os

import threading

import time

import mraa

LIGHT_STRIP_LENGTH = 96

FRAME_PERIOD = (10.0 / 1000) # 100 Hz

class LightExample:

def __init__(self):

# fix power-on SPI glitches

os.system("echo on >/sys/devices/pci0000\:00/0000\:00\:07.1/power/control")

self.spi = mraa.Spi(0)

self.spi.mode(mraa.SPI_MODE3)

self.spi.frequency(8000000)

threading.Thread(target=self._loop).start()

def _loop(self):

while True:

self.write_update([(0.9, 255, 0, 0)] * LIGHT_STRIP_LENGTH)

time.sleep(FRAME_PERIOD)

def write_words(self, words):

ba = bytearray(4 * len(words))

for i, word in enumerate(words):

ba[i * 4:i * 4 + 4] = ((word >> 24) & 0xFF, (word >> 16) & 0xFF, (word >> 8) & 0xFF, (word >> 0) & 0xFF)

self.spi.write(ba)

def write_update(self, colors):

words = [0x00000000]

for bright, r, g, b in colors:

assert 0.0 <= bright <= 1.0 and 0 <= r < 256 and 0 <= g < 256 and 0 <= b < 256

words += [(0b111 << 29 | int(bright * 31) << 24 | b << 16 | g << 8 | r << 0)]

words += [0xFFFFFFFF] * int((len(colors) + 63) / 64)

self.write_words(words)

if __name__ == "__main__":

e = LightExample()

time.sleep(60)

Put that in /home/root/example.py.

[Unit]

Description=Example Application

Requires=bluetooth.target bluetooth.service pulseaudio.service

After=bluetooth.target bluetooth.service

[Service]

ExecStart=/usr/bin/python2 /home/root/example.py

Restart=always

[Install]

WantedBy=multi-user.target

Put that in /usr/lib/systemd/system/example.service. (You may need to create a folder first.) Enable the service with "systemctl enable example".

Reboot the system, and try stopping the service with "systemctl stop example".

idata
Community Manager
71 Views

Hi celskeggs,

 

 

We are still investigating your case, we'll let you know once we have some updates. Thank you for your patience.

 

 

Regards,

 

-Pablo
idata
Community Manager
71 Views

Hi Celskeggs,

 

 

We haven't been able to reproduce the issue, the system service is not able to start as it enters a fail state, we followed the steps exactly as you mentioned in here but no success. Is there any other detail that you can provide, like your image version, python version or any external hardware connected, board used, any specific external power supply, etc. Thanks in advance.

 

 

Regards,

 

-Pablo
CSkeg
Novice
71 Views

The system service should not enter a fail state, even after being stopped - that's probably a mismatch in your replication of my environment, not the bug being resolved.

As I stated earlier, I'm using the release iot-devkit-prof-dev-image-edison-20160606.zip, which contains kernel 3.10.98-poky-edison+. External hardware included, at various times: unpowered USB hub, USB soundcard, FTDI serial cable, device connected to UART, SPI (APA102) lightstrip connected. I used a 12VDC 1A power supply (model SM-333B). I used python 2.7.3. I removed the following packages from my device: clloader xdk-daemon ofono wyliodrin-server redis mosquitto-dev iotkit-comm-c-dev iotkit-comm-c iotkit-comm-js mosquitto tinyb-dev tinyb connman bluez5-dev bluez5 bluez5-obex (with the opkg remove command) and compiled and installed bluez-5.40 on the device itself. I also installed pyserial 3.1.1. I used the Edison Arduino Breakout board.

My development environment has changed significantly in the past few weeks. I tried to reproduce this issue again on a fresh device, but wasn't easily able to do so - the service stopped successfully, but did not enter a fail state. I've moved on and need to be doing other work and can't spend the time necessary to make the issue occur again.

The issue isn't resolved for us, but the workaround works well enough and we don't have the time to assist with finding the real cause.

 

Thank you for your help!

idata
Community Manager
71 Views

Hi Celskeggs,

 

 

It's good to know that you were able to continue with a workaround. Please let us know if you go back to investigate this issue at some point, and we'll be more than glad to help you. Hopefully we'll have better luck next time.

 

 

Regards,

 

-Pablo