- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
encountered some strange bug and prepared the demonstration.
1. Simple kernel:
--------------------------
#pragma OPENCL EXTENSION cl_intel_printf : enable __kernel void glitch( __local uint *scratch ) { uint k,v; k=0; v=0; printf(" START\n"); uint id = (uint)get_global_id(0); if (id==0) { scratch= v; } barrier( CLK_LOCAL_MEM_FENCE ); while ( scratch <2 ) { if (id==0) { scratch ++; } barrier( CLK_LOCAL_MEM_FENCE ); } printf(" FINISH\n"); }
---------------------------
2. Run with natural grid dimensions: single group 16x1x1 , or 256x1x1
and local memory size, for example, 1024
3. what expected:
the items are started (you see 16 lines of START)
the first of them increments the counter in local memory
and all items are finished (you see 16 lines of FINISH)
4. but FINISH repeat forever!
maybe some stack corruption?
When you switch off the printf and remove its pragma,
the kernel simply does not return which is the same.
---------------------------
cases:
Intel CPU with "-cl-opt-disable" runs fine
Nvidia GPU - no problem with or without optimization.
Intel CPU with optimization and workgroup sizes 8 or 1 runs fine. 2, 4, 16, 64, 256 are bad.
(8 may be the special case: the number of threads in my CPU, i7-4790)
Intel OpenCL SDK and CPU driver are re-installed today.
---------------------------
additional demonstration:
I added two #define's (see 1st and 2nd lines of the code)
GLITCH=0 whows how to overcome with nonzero index for local memory array
and initial value of counter from get_global_id(1)
zero index is bad.
initial value of counter = explicitly written 0 is bad.
GLITCH2 shows that the item counter does not go after 8:
FINISH: z=0 y=0 x=0
FINISH: z=0 y=0 x=1
FINISH: z=0 y=0 x=2
FINISH: z=0 y=0 x=3
FINISH: z=0 y=0 x=4
FINISH: z=0 y=0 x=5
FINISH: z=0 y=0 x=6
FINISH: z=0 y=0 x=7
FINISH: z=0 y=0 x=0
FINISH: z=0 y=0 x=1
FINISH: z=0 y=0 x=2
FINISH: z=0 y=0 x=3
FINISH: z=0 y=0 x=4
FINISH: z=0 y=0 x=5
FINISH: z=0 y=0 x=6
FINISH: z=0 y=0 x=7
FINISH: z=0 y=0 x=0
...etc.
but if you don't call printf with x,y,z, then again everything is OK.
-----
Rather strange.
Would like to know if it works for you?
I don't see any rough errors in my code...
Regards, Petr
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Yes, I agree that there should be get_local_id(0). This demo is made to run in single group. Now replaced with get_local_id(0), nothing changed.
Global and local sizes are 16, full demonstration is attached to first message (UPD: updated ZIP file, corrected mistyping). Does it show the same behaviour for You?
Yes, looks like first item goes too fast and switches everything off without respect to barrier.
But I can't understand why it prints 'FINISH' more times than 'START'.
UPD: It is really executed more times! I added local counter:
printf(" FINISH\n"); atomic_inc(&w); // barrier( CLK_LOCAL_MEM_FENCE ); printf(" FINISH-1: %d\n", w);
(the barrier is commented, as it stops the kernel completely)
Now the program prints
FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH-1: 1312 FINISH FINISH FINISH FINISH FINISH FINISH FINISH FINISH FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 FINISH-1: 1320 ... etc.
Cool picture, looks like the system starts two bunches of 8 items,
but cannot get the second right, and starts it again and again.
Is not it a bug? I started only 16 work-items...
Thanks for your interest!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Petr,
Could you fix the second issue mentioned by Ben and check if it fixes you problem, please?
You can do this by placing additional barrier in the while loop before the if statement.
I would avoid putting any expectations on compiler optimizations results for invalid code.
Thanks,
Alexey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!
Half of the problem gone :) two of three demonstration cases are fixed now.
This one persists:
__kernel void glitch( __local uint *scratch ) { printf(" START\n"); uint id = (uint)get_local_id(0); if (id==0) { scratch[1] = 0; } barrier( CLK_LOCAL_MEM_FENCE ); while ( scratch[1]<1 ) { barrier( CLK_LOCAL_MEM_FENCE ); if (id==0) { scratch[1]++; } barrier( CLK_LOCAL_MEM_FENCE ); } printf(" FINISH\n"); }
If I replace the index occurencies (all three) from [1] to [0] then the problem disappears. All other nonzero values are bad, too.
Do I set wrong local argument size? This is how I do it:
ret = clSetKernelArg( kernel, 0, 1024, NULL);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexey, I am sorry if looked arrogant.
P.S. Now I understood the logic of the second issue. Yes, tricky...
Any loop should not contain barriers, then.
Moreover, the loop is not guaranteed to run in the whole width?
This one freezes completely and does not print 'FINISH':
uint id = (uint)get_local_id(0); printf(" Hello! my id is %d\n", id); if (id==0) { scratch[0] = 0; } barrier( CLK_LOCAL_MEM_FENCE ); while ( scratch[0]<1 ) { if (id==0) { atomic_inc( &scratch[0] ); } } printf(" FINISH\n");
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi PetrF,
Devs have suggested the spinning and atomic behavior in the example exhibits behavior outside of whats defined in OpenCL 2.0 Section 3.2.2 standard.
"In the absence of work-group functions (e.g. a barrier), work-items within a workgroup may be serialized. In the presence of work-group functions, work-items within a workgroup may be serialized before any given work-group function, between dynamically encountered pairs of work-group functions and between a work-group function and the end of the kernel."
So, if the order of work-items execution is, for example, 3, 2, 1, 0 and the were serialized (this is correct by the spec), first executed work-item with id = 3 will never exit the loop.
Secondly, this section from 3.2.2:
"The work-items within a single work-group execute concurrently but not necessarily in parallel (i.e. they are not guaranteed to make independent forward progress). "
"The potential of the work-items within a workgroup to be serialized means that independent forward progress of the work-items cannot be assumed; therefore, synchronization between subsets of work-items within a work-group (e.g. using spin-locks) cannot portably be supported in OpenCL. "
-MichaelC
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page