<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I tried a bunch of in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113377#M5341</link>
    <description>&lt;P&gt;I tried a bunch of workarounds this morning including building a repro case.&lt;/P&gt;

&lt;P&gt;The repro case works (attached at bottom) in isolation.&lt;/P&gt;

&lt;P&gt;I'm broadcasting a 64-bit ulong across the subgroup so I resorted to printf() and ... it revealed that only the low dword of the 64-bit ulong was being broadcast -- the high dword was 0.&lt;/P&gt;

&lt;P&gt;The quick workaround? &amp;nbsp;The ulong I was broadcasting was a nice union type that besides exposing a ulong it also exposed a lo and hi uint so explicitly splitting the broadcast into lo and hi broadcasts worked around the problem.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;// sg_lid = [0,7]
// keys is a sub group wide register with a different key in each lane/item
// key is broadcast and then processed by the subgroup
#if   0
          key.b64    = sub_group_broadcast(keys.b64,sg_lid);    // FAIL
#elif 1
          key.lo.b32 = sub_group_broadcast(keys.lo.b32,sg_lid); // WORKS
          key.hi.b32 = sub_group_broadcast(keys.hi.b32,sg_lid);
#else
          key.b64    = work_group_broadcast(keys.b64,sg_lid);   // WORKS BUT BAD
#endif
&lt;/PRE&gt;

&lt;P&gt;So... the compiler is failing somewhere.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I can't send my codebase at this time so my report isn't very helpful.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The &lt;EM&gt;working &lt;/EM&gt;repro case for broadcasting ulongs is below:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__kernel 
__attribute__((intel_reqd_sub_group_size(8)))
void
bug_sub_group_broadcast(__global ulong const * restrict const vin, __global ulong * restrict const vout)
{
	uint const base = (uint)get_group_id(0) * get_enqueued_num_sub_groups() + get_sub_group_id();

	ulong t_s = vin[base * 8 + get_sub_group_local_id()];

	for (int ii=0; ii&amp;lt;8; ii++)
	{
		vout[base * 8 * 8 + ii * 8 + get_sub_group_local_id()] = sub_group_broadcast(t_s, ii);
	}
	
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 15 Dec 2016 16:44:33 GMT</pubDate>
    <dc:creator>allanmac1</dc:creator>
    <dc:date>2016-12-15T16:44:33Z</dc:date>
    <item>
      <title>sub_group_broadcast() broken on GEN9 (21.20.16.4552)</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113375#M5339</link>
      <description>&lt;P&gt;I have a kernel with a "required subgroup size" of 8.&lt;/P&gt;

&lt;P&gt;My test is launching a grid of 24 global work items and 8 local work items (only for testing purposes).&lt;/P&gt;

&lt;P&gt;After much debugging, the sub_group_broadcast() function was determined to be the culprit.&lt;/P&gt;

&lt;P&gt;Replacing it with work_group_broadcast() resulted in a working kernel.&lt;/P&gt;

&lt;P&gt;Is this a known bug? &amp;nbsp;&lt;/P&gt;

&lt;P&gt;All of the other sub_group_XXX() functions appear to be working.&lt;/P&gt;

&lt;P&gt;-Allan&lt;/P&gt;

&lt;P&gt;Platform: Win10 x64, HD 530, 21.20.16.4552.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 01:14:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113375#M5339</guid>
      <dc:creator>allanmac1</dc:creator>
      <dc:date>2016-12-14T01:14:27Z</dc:date>
    </item>
    <item>
      <title>Thanks for this report.  I</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113376#M5340</link>
      <description>&lt;P&gt;Thanks for this report. &amp;nbsp;I have not seen this on the bug list. &amp;nbsp;Is there anything you can send us as a reproducer?&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 01:52:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113376#M5340</guid>
      <dc:creator>Jeffrey_M_Intel1</dc:creator>
      <dc:date>2016-12-14T01:52:36Z</dc:date>
    </item>
    <item>
      <title>I tried a bunch of</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113377#M5341</link>
      <description>&lt;P&gt;I tried a bunch of workarounds this morning including building a repro case.&lt;/P&gt;

&lt;P&gt;The repro case works (attached at bottom) in isolation.&lt;/P&gt;

&lt;P&gt;I'm broadcasting a 64-bit ulong across the subgroup so I resorted to printf() and ... it revealed that only the low dword of the 64-bit ulong was being broadcast -- the high dword was 0.&lt;/P&gt;

&lt;P&gt;The quick workaround? &amp;nbsp;The ulong I was broadcasting was a nice union type that besides exposing a ulong it also exposed a lo and hi uint so explicitly splitting the broadcast into lo and hi broadcasts worked around the problem.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;// sg_lid = [0,7]
// keys is a sub group wide register with a different key in each lane/item
// key is broadcast and then processed by the subgroup
#if   0
          key.b64    = sub_group_broadcast(keys.b64,sg_lid);    // FAIL
#elif 1
          key.lo.b32 = sub_group_broadcast(keys.lo.b32,sg_lid); // WORKS
          key.hi.b32 = sub_group_broadcast(keys.hi.b32,sg_lid);
#else
          key.b64    = work_group_broadcast(keys.b64,sg_lid);   // WORKS BUT BAD
#endif
&lt;/PRE&gt;

&lt;P&gt;So... the compiler is failing somewhere.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I can't send my codebase at this time so my report isn't very helpful.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The &lt;EM&gt;working &lt;/EM&gt;repro case for broadcasting ulongs is below:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__kernel 
__attribute__((intel_reqd_sub_group_size(8)))
void
bug_sub_group_broadcast(__global ulong const * restrict const vin, __global ulong * restrict const vout)
{
	uint const base = (uint)get_group_id(0) * get_enqueued_num_sub_groups() + get_sub_group_id();

	ulong t_s = vin[base * 8 + get_sub_group_local_id()];

	for (int ii=0; ii&amp;lt;8; ii++)
	{
		vout[base * 8 * 8 + ii * 8 + get_sub_group_local_id()] = sub_group_broadcast(t_s, ii);
	}
	
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Dec 2016 16:44:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/sub-group-broadcast-broken-on-GEN9-21-20-16-4552/m-p/1113377#M5341</guid>
      <dc:creator>allanmac1</dc:creator>
      <dc:date>2016-12-15T16:44:33Z</dc:date>
    </item>
  </channel>
</rss>

