kernel panic from kernel with 512 workgroup size

Altera_Forum · ‎07-28-2014

Hello!

I got some undesired behavior. (Opencl 14.0 on CentOs 6.5 with Pico m506)

I have two identical kernels that differs only in __attribute__((reqd_work_group_size(32,1,1)))

One has workgroup size 32, other 512.

My host program has this:

size_t global_work_size[1] = { NDRANGE };

size_t local_work_size[1] = { WORK_GROUP_SIZE };

status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 1, write_event, &kernel_event);

I have problem with launching kernel with 512 work group size with big NDRANGE value. It is ok with NDRANGE = 512, but with NDRANGE = 100000 i have kernel panic (listed below).

I don't have this problem with kernel with workgroup size = 32 and NDRANGE = 40 000 000.

Crash report:

<4>

<4>Pid: 3987, comm: aclkmdq Tainted: P --------------- 2.6.32-431.20.3.el6.x86_64 #1 (https://picocomputing.zendesk.com/tickets/1) ASUS All Series/Z87-PLUS

<4>RIP: 0010:[<ffffffff8152a4f3>] [<ffffffff8152a4f3>] down_write+0x23/0x40

<4>RSP: 0018:ffff8807ad497d40 EFLAGS: 00010246

<4>RAX: 0000000000000068 RBX: 0000000000000068 RCX: 00000000000ff940

<4>RDX: ffffffff00000001 RSI: ffff8807f48aa000 RDI: 0000000000000068

<4>RBP: ffff8807ad497d50 R08: 0000000000000000 R09: 00000000ffffffff

<4>aclpci_close (185):

<7>aclpci = ffff88080d87a000, pid = 3978, dma_idle = 0

<4>R10: 000000a47dceaa6c R11: 0000000000000001 R12: 0000000000000100

<4>R13: ffff8807f48aa000 R14: ffff8807ad497fd8 R15: ffffe8ffffc0f048

<4>FS: 0000000000000000(0000) GS:ffff880028340000(0000) knlGS:0000000000000000

<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b

<4>CR2: 0000000000000068 CR3: 00000007e882a000 CR4: 00000000001407e0

<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

<4>Process aclkmdq (pid: 3987, threadinfo ffff8807ad496000, task ffff8807f1a22040)

<4>Stack:

<4> 0000000000000000 ffff88080f773500 ffff8807ad497d80 ffffffffa01458f2

<4><d> 0000000000000000 ffff8807ad4e8030 ffff88080d87a000 00000001000633ad

<4><d> ffff8807ad497db0 ffffffffa01434b9 ffff8807ad497db0 ffff88080d87a000

<4>Call Trace:

<4> [<ffffffffa01458f2>] aclpci_release_user_pages+0x32/0x70 [aclpci_drv]

<4> [<ffffffffa01434b9>] unlock_dma_buffer+0x39/0x80 [aclpci_drv]

<4> [<ffffffffa01441a0>] ? wq_func_dma_update+0x0/0x20 [aclpci_drv]

<4> [<ffffffffa0143bf3>] aclpci_dma_update+0xe3/0x690 [aclpci_drv]

<4> [<ffffffffa01441a0>] ? wq_func_dma_update+0x0/0x20 [aclpci_drv]

<4> [<ffffffffa01441b7>] wq_func_dma_update+0x17/0x20 [aclpci_drv]

<4> [<ffffffff81094a20>] worker_thread+0x170/0x2a0

<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40

<4> [<ffffffff810948b0>] ? worker_thread+0x0/0x2a0

<4> [<ffffffff8109abf6>] kthread+0x96/0xa0

<4> [<ffffffff8100c20a>] child_rip+0xa/0x20

<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0

<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

<4>Code: c3 e8 62 77 b4 ff 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 0a ed ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 48 85 d2 74 05 e8 fe 4c d6 ff 48 83 c4 08 5b c9

<1>RIP [<ffffffff8152a4f3>] down_write+0x23/0x40

<4> RSP <ffff8807ad497d40>

<4>CR2: 0000000000000068

Altera_Forum · ‎08-11-2014

The Linux kernel panic happens during DMA transfer, which is done by the host on the CPU with a help of Altera OpenCL Linux kernel driver (aclpci_drv). My guess here is that you're trying to DMA more memory than you actually allocated. Double-check sizes passed to clCreateBuffer versus clEnqueueReadBuffer or clEnqueueWriteBuffer. I suspect that the clEnqueueRead/WriteBuffer got a larger size argument than the one to clCreateBuffer.

Altera_Forum · ‎11-20-2014

No, all sizes are equal.

Altera_Forum · ‎07-13-2015

I am also stuck up in the same position. have you found out the solution for that?