OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Inefficient clz implementation

Gregory_S__Chudov
1,556 Views
My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have.

I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.

Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.
[plain]__Z3clzi:                               # @_Z3clzi
# BB#0:
	mov	ECX, -2147483648
	xor	EAX, EAX
	mov	EDX, DWORD PTR [ESP + 4]
	jmp	LBB1_1
	.align	16, 0x90
LBB1_3:                                 #   in Loop: Header=BB1_1 Depth=1
	inc	EAX
	shr	ECX
LBB1_1:                                 # =>This Inner Loop Header: Depth=1
	test	ECX, ECX
	je	LBB1_4
# BB#2:                                 #   in Loop: Header=BB1_1 Depth=1
	test	ECX, EDX
	je	LBB1_3
LBB1_4:
	ret
[/plain]

More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.

Other problem is that even when given a constant argument it still generates a slow code instead of calculating result at compile time.
0 Kudos
1 Solution
Boaz_O_Intel
Employee
1,556 Views
Hi Gregory,

This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases

Thanks for the post,
Boaz Ouriel

View solution in original post

0 Kudos
3 Replies
Gregory_S__Chudov
1,556 Views
Here's an example of faster clz:
[cpp]inline int fastclz(int iv)
{
unsigned int v = (unsigned int)iv;
int x = (0 != (v >> 16)) * 16;
x += (0 != (v >> (x + 8))) * 8;
x += (0 != (v >> (x + 4))) * 4;
x += (0 != (v >> (x + 2))) * 2;
x += (0 != (v >> (x + 1)));
x += (0 != (v >> x));
return 32 - x;
}
[/cpp]
0 Kudos
Boaz_O_Intel
Employee
1,557 Views
Hi Gregory,

This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases

Thanks for the post,
Boaz Ouriel
0 Kudos
neni
New Contributor II
1,556 Views
or use 32-bsr :-)
0 Kudos
Reply