- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have.
I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.
Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.
More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.
Other problem is that even when given a constant argument it still generates a slow code instead of calculating result at compile time.
I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.
Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.
[plain]__Z3clzi: # @_Z3clzi # BB#0: mov ECX, -2147483648 xor EAX, EAX mov EDX, DWORD PTR [ESP + 4] jmp LBB1_1 .align 16, 0x90 LBB1_3: # in Loop: Header=BB1_1 Depth=1 inc EAX shr ECX LBB1_1: # =>This Inner Loop Header: Depth=1 test ECX, ECX je LBB1_4 # BB#2: # in Loop: Header=BB1_1 Depth=1 test ECX, EDX je LBB1_3 LBB1_4: ret [/plain]
More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.
Other problem is that even when given a constant argument it still generates a slow code instead of calculating result at compile time.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gregory,
This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases
Thanks for the post,
Boaz Ouriel
This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases
Thanks for the post,
Boaz Ouriel
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's an example of faster clz:
[cpp]inline int fastclz(int iv)
{
unsigned int v = (unsigned int)iv;
int x = (0 != (v >> 16)) * 16;
x += (0 != (v >> (x + 8))) * 8;
x += (0 != (v >> (x + 4))) * 4;
x += (0 != (v >> (x + 2))) * 2;
x += (0 != (v >> (x + 1)));
x += (0 != (v >> x));
return 32 - x;
}
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gregory,
This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases
Thanks for the post,
Boaz Ouriel
This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases
Thanks for the post,
Boaz Ouriel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
or use 32-bsr :-)

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page