Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
104 Views

Inefficient clz implementation

Jump to solution
My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have.

I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.

Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.
[plain]__Z3clzi:                               # @_Z3clzi
# BB#0:
	mov	ECX, -2147483648
	xor	EAX, EAX
	mov	EDX, DWORD PTR [ESP + 4]
	jmp	LBB1_1
	.align	16, 0x90
LBB1_3:                                 #   in Loop: Header=BB1_1 Depth=1
	inc	EAX
	shr	ECX
LBB1_1:                                 # =>This Inner Loop Header: Depth=1
	test	ECX, ECX
	je	LBB1_4
# BB#2:                                 #   in Loop: Header=BB1_1 Depth=1
	test	ECX, EDX
	je	LBB1_3
LBB1_4:
	ret
[/plain]

More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.

Other problem is that even when given a constant argument it still generates a slow code instead of calculating result at compile time.
0 Kudos

Accepted Solutions
Highlighted
Employee
104 Views
Hi Gregory,

This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases

Thanks for the post,
Boaz Ouriel

View solution in original post

0 Kudos
3 Replies
Highlighted
104 Views
Here's an example of faster clz:
[cpp]inline int fastclz(int iv)
{
unsigned int v = (unsigned int)iv;
int x = (0 != (v >> 16)) * 16;
x += (0 != (v >> (x + 8))) * 8;
x += (0 != (v >> (x + 4))) * 4;
x += (0 != (v >> (x + 2))) * 2;
x += (0 != (v >> (x + 1)));
x += (0 != (v >> x));
return 32 - x;
}
[/cpp]
0 Kudos
Highlighted
Employee
105 Views
Hi Gregory,

This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases

Thanks for the post,
Boaz Ouriel

View solution in original post

0 Kudos
Highlighted
New Contributor II
104 Views
or use 32-bsr :-)
0 Kudos