topic Inefficient clz implementation in OpenCL* for CPU

Inefficient clz implementation

Gregory_S__Chudov — Sat, 20 Nov 2010 11:23:32 GMT

My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have.

I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.

Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.

[plain]__Z3clzi:                               # @_Z3clzi
# BB#0:
	mov	ECX, -2147483648
	xor	EAX, EAX
	mov	EDX, DWORD PTR [ESP + 4]
	jmp	LBB1_1
	.align	16, 0x90
LBB1_3:                                 #   in Loop: Header=BB1_1 Depth=1
	inc	EAX
	shr	ECX
LBB1_1:                                 # =>This Inner Loop Header: Depth=1
	test	ECX, ECX
	je	LBB1_4
# BB#2:                                 #   in Loop: Header=BB1_1 Depth=1
	test	ECX, EDX
	je	LBB1_3
LBB1_4:
	ret
[/plain]

More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.

Other problem is that even when given a constant argument it still generates a slow code instead of calculating result at compile time.

Inefficient clz implementation

Gregory_S__Chudov — Sat, 20 Nov 2010 13:03:45 GMT

Here's an example of faster clz:

[cpp]inline int fastclz(int iv)
{
 unsigned int v = (unsigned int)iv;
 int x = (0 != (v >> 16)) * 16;
 x += (0 != (v >> (x + 8))) * 8;
 x += (0 != (v >> (x + 4))) * 4;
 x += (0 != (v >> (x + 2))) * 2;
 x += (0 != (v >> (x + 1)));
 x += (0 != (v >> x));
 return 32 - x;
}
[/cpp]

Inefficient clz implementation

Boaz_O_Intel — Mon, 22 Nov 2010 15:15:54 GMT

Hi Gregory,

This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases

Thanks for the post,
Boaz Ouriel

Inefficient clz implementation

neni — Tue, 30 Nov 2010 21:44:20 GMT

or use 32-bsr :-)