<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Inefficient clz implementation in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782068#M422</link>
    <description>or use 32-bsr :-)</description>
    <pubDate>Tue, 30 Nov 2010 21:44:20 GMT</pubDate>
    <dc:creator>neni</dc:creator>
    <dc:date>2010-11-30T21:44:20Z</dc:date>
    <item>
      <title>Inefficient clz implementation</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782065#M419</link>
      <description>My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have. &lt;BR /&gt;&lt;BR /&gt;I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.&lt;BR /&gt;&lt;BR /&gt;Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.&lt;PRE&gt;[plain]__Z3clzi:                               # @_Z3clzi
# BB#0:
	mov	ECX, -2147483648
	xor	EAX, EAX
	mov	EDX, DWORD PTR [ESP + 4]
	jmp	LBB1_1
	.align	16, 0x90
LBB1_3:                                 #   in Loop: Header=BB1_1 Depth=1
	inc	EAX
	shr	ECX
LBB1_1:                                 # =&amp;gt;This Inner Loop Header: Depth=1
	test	ECX, ECX
	je	LBB1_4
# BB#2:                                 #   in Loop: Header=BB1_1 Depth=1
	test	ECX, EDX
	je	LBB1_3
LBB1_4:
	ret
[/plain]&lt;/PRE&gt; &lt;BR /&gt;More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,&lt;BR /&gt;so clz would only need to cycle through four bytes.&lt;BR /&gt;&lt;BR /&gt;Other problem is that even when given a constant argument it still 
generates a slow code instead of calculating result at compile time.</description>
      <pubDate>Sat, 20 Nov 2010 11:23:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782065#M419</guid>
      <dc:creator>Gregory_S__Chudov</dc:creator>
      <dc:date>2010-11-20T11:23:32Z</dc:date>
    </item>
    <item>
      <title>Inefficient clz implementation</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782066#M420</link>
      <description>Here's an example of faster clz:&lt;BR /&gt;&lt;PRE&gt;[cpp]inline int fastclz(int iv)&lt;BR /&gt;{&lt;BR /&gt; unsigned int v = (unsigned int)iv;&lt;BR /&gt; int x = (0 != (v &amp;gt;&amp;gt; 16)) * 16;&lt;BR /&gt; x += (0 != (v &amp;gt;&amp;gt; (x + 8))) * 8;&lt;BR /&gt; x += (0 != (v &amp;gt;&amp;gt; (x + 4))) * 4;&lt;BR /&gt; x += (0 != (v &amp;gt;&amp;gt; (x + 2))) * 2;&lt;BR /&gt; x += (0 != (v &amp;gt;&amp;gt; (x + 1)));&lt;BR /&gt; x += (0 != (v &amp;gt;&amp;gt; x));&lt;BR /&gt; return 32 - x;&lt;BR /&gt;}&lt;BR /&gt;[/cpp]&lt;/PRE&gt;</description>
      <pubDate>Sat, 20 Nov 2010 13:03:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782066#M420</guid>
      <dc:creator>Gregory_S__Chudov</dc:creator>
      <dc:date>2010-11-20T13:03:45Z</dc:date>
    </item>
    <item>
      <title>Inefficient clz implementation</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782067#M421</link>
      <description>Hi Gregory,&lt;BR /&gt;&lt;BR /&gt;This is a good suggestion and we will consider adopting this kind of approach for clzinour future releases&lt;BR /&gt;&lt;BR /&gt;Thanks for the post,&lt;BR /&gt;Boaz Ouriel</description>
      <pubDate>Mon, 22 Nov 2010 15:15:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782067#M421</guid>
      <dc:creator>Boaz_O_Intel</dc:creator>
      <dc:date>2010-11-22T15:15:54Z</dc:date>
    </item>
    <item>
      <title>Inefficient clz implementation</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782068#M422</link>
      <description>or use 32-bsr :-)</description>
      <pubDate>Tue, 30 Nov 2010 21:44:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Inefficient-clz-implementation/m-p/782068#M422</guid>
      <dc:creator>neni</dc:creator>
      <dc:date>2010-11-30T21:44:20Z</dc:date>
    </item>
  </channel>
</rss>

