Community
cancel
Showing results for 
Search instead for 
Did you mean: 
JPile
New Contributor I
1,486 Views

wait_for_completion never ends: SPI bug in Edison Linux kernel?

Jump to solution

Hi,

I am using an Intel Edison with a MCP2515 can controller.

 

(my fork: https://github.com/jpilet/edison-linux/tree/edison-3.10.17-anemobox-mcp2515 https://github.com/jpilet/edison-linux/tree/edison-3.10.17-anemobox-mcp2515)

I configured the mcp251x driver to use the edison spi cs0. It works: I can "candump" the interface can0 for a while, and properly read and write from and to the bus.

However, after a few hours, the can interface stops working and dmesg tells me:

[ 5759.821413] INFO: task irq/304-mcp251x:521 blocked for more than 120 seconds.

[ 5759.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 5759.821579] irq/304-mcp251x D f5757a60 6544 521 2 0x00000000

[ 5759.821623] f579bd64 00000046 c18673a6 f5757a60 f579bd20 c1c00000 f5cd8000 a2b704e8

[ 5759.821681] 0000043e f5757a60 f73fdc00 f5751bd0 f5757a60 f64eeb44 f6c0e000 f579bd2c

[ 5759.821735] c127289f f6438900 f579bd34 c125ae1e f579bd50 c125c593 f6438900 00000007

[ 5759.821790] Call Trace:

[ 5759.821833] [] ? _raw_spin_unlock_irqrestore+0x26/0x60

[ 5759.821866] [] ? wake_up_process+0x1f/0x40

[ 5759.821893] [] ? wake_up_worker+0x1e/0x20

[ 5759.821919] [] ? insert_work+0x53/0x90

[ 5759.821947] [] ? _raw_spin_unlock+0x17/0x40

[ 5759.821973] [] ? __queue_work+0x10f/0x340

[ 5759.822000] [] schedule+0x23/0x60

[ 5759.822027] [] schedule_timeout+0x165/0x2a0

[ 5759.822060] [] ? get_parent_ip+0xb/0x40

[ 5759.822086] [] ? sub_preempt_count+0x95/0xf0

[ 5759.822112] [] ? get_parent_ip+0xb/0x40

[ 5759.822140] [] wait_for_completion+0xab/0xe0

[ 5759.822165] [] ? wake_up_state+0x20/0x20

[ 5759.822194] [] __spi_sync+0x68/0xb0

[ 5759.822218] [] ? panic+0xfe/0x178

[ 5759.822247] [] spi_sync+0xf/0x20

[ 5759.822279] [] mcp251x_spi_trans+0x94/0xc0 [mcp251x]

[ 5759.822307] [] ? kmem_cache_alloc+0xd4/0x1b0

[ 5759.822338] [] ? spi_sync_locked+0x20/0x20

[ 5759.822369] [] mcp251x_hw_rx+0x72/0x290 [mcp251x]

[ 5759.822402] [] mcp251x_can_ist+0x292/0x3b0 [mcp251x]

[ 5759.822429] [] ? _raw_spin_unlock_irq+0x1d/0x40

[ 5759.822459] [] irq_thread_fn+0x18/0x30

[ 5759.822486] [] irq_thread+0x110/0x140

[ 5759.822512] [] ? irq_finalize_oneshot.part.29+0xb0...

1 Solution
JPile
New Contributor I
259 Views

Hi,

I finally managed to fix the bug.

The fixed kernel is there

https://github.com/jpilet/linux-anemobox/tree/jp-anemobox-dma-merged GitHub - jpilet/linux-anemobox at jp-anemobox-dma-merged: Linux sources adapted to Anemomind's anemobox

I started from

And I fixed 8 bit reading with the following patch.

https://github.com/jpilet/linux-anemobox/commit/a785086b27cf2707e6591731137269f723e60c07 Fix 8bit spi reads · jpilet/linux-anemobox@a785086 · GitHub

View solution in original post

17 Replies
Sergio_A_Intel
Employee
259 Views

Hi,

We received a case with a dmesg output very similar to yours. It was also about DMA SPI. Are you using a custom build image?

What code are you running, are you using an IDE? When you start running the code it will create a process. With the CAN chip connected to the SPI can you see that process memory increasing?

Sergio

JPile
New Contributor I
259 Views

Hi Sergio,

Yes, I'm using a custom kernel. The sources are there: https://github.com/jpilet/edison-linux/tree/edison-3.10.17-anemobox-mcp2515 jpilet/edison-linux at edison-3.10.17-anemo... · GitHub

Basically, the base kernel is the one from github/01org/edison-linux, branch edison-3.10.17, with my own patch to enable the mcp251x driver, and patches for J1939 support (that's a protocol over CAN, it should not influence SPI at all).

I am not using any IDE. A userland code that exhibit the issue is simply "candump" from can-utils. I did not notice any memory leak, but I did not check. I'll tell you if I notice anything strange regarding memory consumption.

I have never been able to successfully enable DMA SPI to communicate with the mcp2515 device, so I keep DMA off.

Julien

Sergio_A_Intel
Employee
259 Views

Please let us know if you notice anything that looks strange with memory consumption. Other users have reported having issues with memory leaks using SPI so your issue might be related. Keep us updated of your progress.

Sergio

JPile
New Contributor I
259 Views

I'm running a test right now, and the process allocated 4k of additional RAM in 3 hours. I do not think it is a leak.

I'll let the process run during the night and I'll update the thread if I notice anything strange when running longer.

 

When the bug occurs, spi communication totally stops working - a full reboot is the only cure. rmmod + insmod is not enough. The bug is in the kernel, not in userspace.

JPile
New Contributor I
259 Views

I confirm that there is no memory leak: After 8 hours, the memory consumption is still the same.

Sergio_A_Intel
Employee
259 Views

There is a known issue with the current release. The SPI on the latest image has a bug that has been reported before by other users. Take a look at for example. We are working on getting this fixed.

Can you try to run the same scenario using the previous image?

https://downloadcenter.intel.com/downloads/eula/24910/Intel-Edison-Software-Release-2-1?httpDown=htt... Intel Edison Software Release 2.1 . Download the file edison-image-ww18-15

Let us know your results.

Sergio

JPile
New Contributor I
259 Views

Thanks for the link. It looks like is related.

To test with another kernel version, I have to recompile it with support for mcp251x and can networking. How can I compile the kernel of release 2.1?

JPile
New Contributor I
259 Views

I have a more precise idea about where the bug is: in spi-pxa2xx.c.

 

I noticed that after the bug, the CPU usage is quite high. I profiled the kernel using perf, and I got the following results:

# Overhead Command Shared Object Symbol

# ........ ............... ................. ..................................

#

48.77% kworker/u4:2 [kernel.kallsyms] [k] u8_writer

|

--- u8_writer

|

|--99.01%-- poll_writer

| process_one_work

| worker_thread

| kthread

| ret_from_kernel_thread

|

--0.99%-- process_one_work

worker_thread

kthread

ret_from_kernel_thread

47.15% kworker/u4:1 [kernel.kallsyms] [k] u8_reader

|

--- u8_reader

|

|--96.54%-- handle_message

| pump_messages

| process_one_work

| worker_thread

| kthread

| ret_from_kernel_thread

|

--3.46%-- pump_messages

process_one_work

worker_thread

kthread

ret_from_kernel_thread

2.61% kworker/u4:1 [kernel.kallsyms] [k] handle_message

 

 

u8_writer and u8_reader are part of spi-pxa2xx (https://github.com/01org/edison-linux/blob/8cd9234c64c584432f6992fe944ca9e46ca8ea76/drivers/spi/spi-... L338 https://github.com/01org/edison-linux/blob/8cd9234c64c584432f6992fe944ca9e46ca8ea76/drivers/spi/spi-... L338)

It seems that one thread is polling trying to write while the other one is polling trying to read. Without success.

Do you have any idea of what could put the SPI logic is such a locked state?

I updated my github issue in the hope of reaching one of the developer of spi-pxa2xx.c. Is it the right place?

 

cchao6
New Contributor I
259 Views

Hi JulienP,

I have meet the same problem as yours.

My spi driver is blocked after running a few hours.

Do you know how to fix it now?

PS:

Do you use workqueue in your device driver?

Regards

KEckh
Valued Contributor III
259 Views

I met some of the same problems with the build back when it was released and mentioned it in a thread or two

I have debugged similar problems in other OS's but have never done Linux kernel debugging, so I did not get very far, except to notice the same stuff as mentioned in JulienP

response.

At the time my assumptions was that it was something like:

a) Classic Semaphore/Lock issue, where two threads both want the same resources and ask for them in different order. Thread 1 asks for A and then B, and Thread 2 asks for B and then A. And you get the case where 1 has A and 2 has B and both ask for the other one and hang... But I assumed at the time this was not likely as I would assume in this case both threads would leave the ready queue and simply sleep...

b) simple consumer/provider code, that uses something like shared memory to communicate and the access to the memory is not properly serialized. Again on some other processors, I have run into code that does something like: Consumer: and the Server: ... Problem was in some cases where the Server signals the response before the Consumer does the and so the Consumer lost the response and spins forever (or hopefully until some timeout...)...

The response I was seeing looked more like some form of b) but again I have not done much diving into Linux Kernel code, so I have punted for now...

Hopefully they will figure it out soon, but...

JPile
New Contributor I
259 Views

cer1991: yes, the mcp251x driver in 3.10 kernel uses workqueues.

JPile
New Contributor I
259 Views

My mistake: the problem is most likely not in pxa2xx.c, but in https://github.com/01org/edison-linux/blob/edison-3.10.17/drivers/spi/intel_mid_ssp_spi.c drivers/spi/intel_mid_ssp_spi.c

Sergio_A_Intel
Employee
259 Views

Thank you for the feedback, we'll pass it to the team that's working on fixing the SPI bug.

Sergio

JPile
New Contributor I
259 Views

Thank you Sergio.

Who is that team? Is there a way to follow the progress of this issue?

Sergio_A_Intel
Employee
259 Views

We can't disclose any information of who is working on the fix, that information is internal. They are aware there is a bug in SPI and are working on it. There is no estimated date of release of the software fix.

Sergio

JPile
New Contributor I
260 Views

Hi,

I finally managed to fix the bug.

The fixed kernel is there

https://github.com/jpilet/linux-anemobox/tree/jp-anemobox-dma-merged GitHub - jpilet/linux-anemobox at jp-anemobox-dma-merged: Linux sources adapted to Anemomind's anemobox

I started from

And I fixed 8 bit reading with the following patch.

https://github.com/jpilet/linux-anemobox/commit/a785086b27cf2707e6591731137269f723e60c07 Fix 8bit spi reads · jpilet/linux-anemobox@a785086 · GitHub

View solution in original post

RYoun8
Beginner
259 Views

Thanks for your great work. I am very interested in testing this out. I have cloned the repo but this setup is a little different than rebuilding from the normal linux sources... Do you have any instructions on how to set up the configuration and deploy it to the Edison?

Reply