code optimization + while-loop + local memory counter + CPU = infinite run of the kernel, demo inside

Petr_F__Kartsev · ‎02-24-2018

Hi,

encountered some strange bug and prepared the demonstration.

1. Simple kernel:
--------------------------

#pragma OPENCL EXTENSION cl_intel_printf : enable

__kernel void glitch( 
	__local uint *scratch
) {

	uint k,v;
	k=0;
	v=0; 

	printf(" START\n");
	uint id = (uint)get_global_id(0);
	if (id==0) {
		scratch = v;
	}
	barrier( CLK_LOCAL_MEM_FENCE );
	while ( scratch<2 ) {
		if (id==0) {
			scratch++;
		}
		barrier( CLK_LOCAL_MEM_FENCE );
	}
	printf(" FINISH\n");
}

---------------------------

2. Run with natural grid dimensions: single group 16x1x1 , or 256x1x1
and local memory size, for example, 1024

3. what expected:
the items are started (you see 16 lines of START)
the first of them increments the counter in local memory
and all items are finished (you see 16 lines of FINISH)

4. but FINISH repeat forever!
maybe some stack corruption?
When you switch off the printf and remove its pragma,
the kernel simply does not return which is the same.

---------------------------
cases:

Intel CPU with "-cl-opt-disable" runs fine

Nvidia GPU - no problem with or without optimization.

Intel CPU with optimization and workgroup sizes 8 or 1 runs fine. 2, 4, 16, 64, 256 are bad.
(8 may be the special case: the number of threads in my CPU, i7-4790)

Intel OpenCL SDK and CPU driver are re-installed today.

---------------------------

additional demonstration:
I added two #define's (see 1st and 2nd lines of the code)

GLITCH=0 whows how to overcome with nonzero index for local memory array
and initial value of counter from get_global_id(1)

zero index is bad.

initial value of counter = explicitly written 0 is bad.

GLITCH2 shows that the item counter does not go after 8:

FINISH: z=0 y=0 x=0
FINISH: z=0 y=0 x=1
FINISH: z=0 y=0 x=2
FINISH: z=0 y=0 x=3
FINISH: z=0 y=0 x=4
FINISH: z=0 y=0 x=5
FINISH: z=0 y=0 x=6
FINISH: z=0 y=0 x=7
FINISH: z=0 y=0 x=0
FINISH: z=0 y=0 x=1
FINISH: z=0 y=0 x=2
FINISH: z=0 y=0 x=3
FINISH: z=0 y=0 x=4
FINISH: z=0 y=0 x=5
FINISH: z=0 y=0 x=6
FINISH: z=0 y=0 x=7
FINISH: z=0 y=0 x=0
...etc.

but if you don't call printf with x,y,z, then again everything is OK.

-----
Rather strange.
Would like to know if it works for you?
I don't see any rough errors in my code...

Regards, Petr

Ben_A_Intel · ‎02-26-2018

Hello, I'd need to see both your global work size and your local work size to be sure, but since scratch is only incremented by work item global id zero, won't there be an infinite loop any time there is more than one work group? I think there's also a chance that the loop could execute a varying number of times for work items in a work group, and hence could violate the spec restriction that "If work_group_barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the work_group_barrier." It's a little tricky to see this one, but consider that one of the work items in the work group enters the while loop, increments the counter, and hits the barrier before another work item even evaluates the while loop condition. Barriers in control flow are tricky! Good luck, hope this helps. -- Ben

Petr_F__Kartsev · ‎02-26-2018

Hello,

Yes, I agree that there should be get_local_id(0). This demo is made to run in single group. Now replaced with get_local_id(0), nothing changed.

Global and local sizes are 16, full demonstration is attached to first message (UPD: updated ZIP file, corrected mistyping). Does it show the same behaviour for You?

Yes, looks like first item goes too fast and switches everything off without respect to barrier.

But I can't understand why it prints 'FINISH' more times than 'START'.

UPD: It is really executed more times! I added local counter:

    printf(" FINISH\n");
    atomic_inc(&w);
//    barrier( CLK_LOCAL_MEM_FENCE );
    printf(" FINISH-1: %d\n", w);

(the barrier is commented, as it stops the kernel completely)

Now the program prints

 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH-1: 1312
 FINISH
 FINISH
 FINISH
 FINISH
 FINISH
 FINISH
 FINISH
 FINISH
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
 FINISH-1: 1320
... etc.

Cool picture, looks like the system starts two bunches of 8 items,
but cannot get the second right, and starts it again and again.

Is not it a bug? I started only 16 work-items...

Thanks for your interest!

Alexey_B_Intel1 · ‎02-26-2018

Hi Petr,

Could you fix the second issue mentioned by Ben and check if it fixes you problem, please?

You can do this by placing additional barrier in the while loop before the if statement.

I would avoid putting any expectations on compiler optimizations results for invalid code.

Thanks,

Alexey

Petr_F__Kartsev · ‎02-26-2018

Thanks!

Half of the problem gone :) two of three demonstration cases are fixed now.

This one persists:

__kernel void glitch( 
	__local uint *scratch
) {

	printf(" START\n"); 
	uint id = (uint)get_local_id(0);
	if (id==0) {
		scratch[1] = 0;
	}
	barrier( CLK_LOCAL_MEM_FENCE );
	while ( scratch[1]<1 ) {
		barrier( CLK_LOCAL_MEM_FENCE );
		if (id==0) {
			scratch[1]++;
		}
		barrier( CLK_LOCAL_MEM_FENCE );
	}
	printf(" FINISH\n");
}

If I replace the index occurencies (all three) from [1] to [0] then the problem disappears. All other nonzero values are bad, too.

Do I set wrong local argument size? This is how I do it:

    ret = clSetKernelArg( kernel, 0, 1024, NULL);

Petr_F__Kartsev · ‎02-26-2018

Alexey, I am sorry if looked arrogant.

P.S. Now I understood the logic of the second issue. Yes, tricky...
Any loop should not contain barriers, then.

Moreover, the loop is not guaranteed to run in the whole width?

This one freezes completely and does not print 'FINISH':

	uint id = (uint)get_local_id(0);
	printf(" Hello! my id is %d\n", id);
	if (id==0) {
		scratch[0] = 0;
	}
	barrier( CLK_LOCAL_MEM_FENCE );
	while ( scratch[0]<1 ) {
		if (id==0) {
			atomic_inc( &scratch[0] );
		}
	}
	printf(" FINISH\n");

Michael_C_Intel1 · ‎08-13-2018

Hi PetrF,

Devs have suggested the spinning and atomic behavior in the example exhibits behavior outside of whats defined in OpenCL 2.0 Section 3.2.2 standard.

"In the absence of work-group functions (e.g. a barrier), work-items within a workgroup may be serialized. In the presence of work-group functions, work-items within a workgroup may be serialized before any given work-group function, between dynamically encountered pairs of work-group functions and between a work-group function and the end of the kernel."

So, if the order of work-items execution is, for example, 3, 2, 1, 0 and the were serialized (this is correct by the spec), first executed work-item with id = 3 will never exit the loop.

Secondly, this section from 3.2.2:

"The work-items within a single work-group execute concurrently but not necessarily in parallel (i.e. they are not guaranteed to make independent forward progress). "

"The potential of the work-items within a workgroup to be serialized means that independent forward progress of the work-items cannot be assumed; therefore, synchronization between subsets of work-items within a work-group (e.g. using spin-locks) cannot portably be supported in OpenCL. "

-MichaelC