<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Broadwell IGP needs more sub_group functions in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060331#M4197</link>
    <description>&lt;P&gt;OpenCL 2.0 has no support for a "ballot" style sub-group function. &amp;nbsp;A ballot returns bitmask containing the conditional flag for each "lane" in the sub-group. &amp;nbsp;As long as the sub-group (SIMD) size is 32 or less then this fits in a cl_uint.&lt;/P&gt;

&lt;P&gt;Presumably sub-group any() and all() are implemented on Broadwell IGP by returning an ARF flag register?&lt;/P&gt;

&lt;P&gt;It would be great if Broadwell IGP unofficially implemented sub_group_any() by returning the actual flag bitmask so that developers could apply popcount() and other operations to the mask.&lt;/P&gt;

&lt;P&gt;For those not aware, a classic use case for a ballot mask is packing data in a sub-group into a local memory array without having to use a full exclusive add scan. &amp;nbsp;It's very efficient.&lt;/P&gt;

&lt;P&gt;You can implement a ballot() with an inclusive scan but that's going to be ~8x as many ops for SIMD16.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 27 Jan 2015 17:50:50 GMT</pubDate>
    <dc:creator>Allan_M_</dc:creator>
    <dc:date>2015-01-27T17:50:50Z</dc:date>
    <item>
      <title>Broadwell IGP needs more sub_group functions</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060331#M4197</link>
      <description>&lt;P&gt;OpenCL 2.0 has no support for a "ballot" style sub-group function. &amp;nbsp;A ballot returns bitmask containing the conditional flag for each "lane" in the sub-group. &amp;nbsp;As long as the sub-group (SIMD) size is 32 or less then this fits in a cl_uint.&lt;/P&gt;

&lt;P&gt;Presumably sub-group any() and all() are implemented on Broadwell IGP by returning an ARF flag register?&lt;/P&gt;

&lt;P&gt;It would be great if Broadwell IGP unofficially implemented sub_group_any() by returning the actual flag bitmask so that developers could apply popcount() and other operations to the mask.&lt;/P&gt;

&lt;P&gt;For those not aware, a classic use case for a ballot mask is packing data in a sub-group into a local memory array without having to use a full exclusive add scan. &amp;nbsp;It's very efficient.&lt;/P&gt;

&lt;P&gt;You can implement a ballot() with an inclusive scan but that's going to be ~8x as many ops for SIMD16.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Jan 2015 17:50:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060331#M4197</guid>
      <dc:creator>Allan_M_</dc:creator>
      <dc:date>2015-01-27T17:50:50Z</dc:date>
    </item>
    <item>
      <title>Allan,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060332#M4198</link>
      <description>&lt;P&gt;Allan,&lt;/P&gt;

&lt;P&gt;Internally, we do have such a functionality. I am trying to figure out from our driver architects when we can get this functionality into a production driver.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Feb 2015 22:50:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060332#M4198</guid>
      <dc:creator>Robert_I_Intel</dc:creator>
      <dc:date>2015-02-02T22:50:27Z</dc:date>
    </item>
    <item>
      <title>Thanks Robert!
-Allan M.</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060333#M4199</link>
      <description>&lt;P&gt;Thanks Robert!&lt;/P&gt;

&lt;P&gt;-Allan M.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Feb 2015 22:53:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060333#M4199</guid>
      <dc:creator>allanmac1</dc:creator>
      <dc:date>2015-02-02T22:53:24Z</dc:date>
    </item>
    <item>
      <title>One way of exposing portable</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060334#M4200</link>
      <description>&lt;P&gt;One way of exposing portable ballot() functionality might be to use my suggestion here:&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&lt;A href="https://github.com/KhronosGroup/SPIRV-Headers/issues/9" target="_blank"&gt;https://github.com/KhronosGroup/SPIRV-Headers/issues/9&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The alternative solution at the bottom can be implemented with a &lt;U&gt;simple compiler optimization&lt;/U&gt;&amp;nbsp;and integrated immediately into Intel's OpenCL compiler.&lt;/P&gt;

&lt;P&gt;Perhaps you're already doing this?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;——————————————————————————&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;A native&amp;nbsp;&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;ballot()&lt;/CODE&gt;&amp;nbsp;operation is a useful primitive to exploit for warp/wave/simd work compaction.&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;A subgroup&amp;nbsp;&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;ballot()&lt;/CODE&gt;&amp;nbsp;operation is not exposed in SPIR-V or OpenCL (right?) and the existence of architectures with sub_group widths over 32 lanes preclude this from being represented with a uint32_t.&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;If the&amp;nbsp;&lt;A href="https://www.khronos.org/registry/spir-v/specs/1.1/SPIRV.html#_a_id_group_a_group_instructions" style="box-sizing: border-box; color: rgb(64, 120, 192); background-color: transparent;"&gt;OpGroupIAdd opcode&lt;/A&gt;&amp;nbsp;was relaxed to support differing return and argument types — specifically, an integer return type and boolean argument — then SPIR-V would be able to&amp;nbsp;&lt;EM style="box-sizing: border-box;"&gt;optionally&lt;/EM&gt;&amp;nbsp;efficiently express:&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;popcount( ballot() &amp;amp; lanes_less_than() )&lt;/CODE&gt;&lt;BR style="box-sizing: border-box;" /&gt;
	&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;popcount( ballot() &amp;amp; lanes_less_than_or_equal() )&lt;/CODE&gt;&lt;BR style="box-sizing: border-box;" /&gt;
	&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;popcount( ballot() )&lt;/CODE&gt;&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;This would then allow OpenCL to expose the following potentially optimal sub_group functions:&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;int sub_group_scan_exclusive_add(bool pred)&lt;/CODE&gt;&lt;BR style="box-sizing: border-box;" /&gt;
	&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;int sub_group_scan_inclusive_add(bool pred)&lt;/CODE&gt;&lt;BR style="box-sizing: border-box;" /&gt;
	&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;int sub_group_reduce_add(bool pred)&lt;/CODE&gt;&lt;/P&gt;

&lt;P style="box-sizing: border-box; margin-bottom: 16px; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px;"&gt;&lt;EM style="box-sizing: border-box;"&gt;Alternatively&lt;/EM&gt;, simply recognizing cases where the integer subgroup scan argument is guaranteed to be 0 or 1 would allow a native&amp;nbsp;&lt;CODE style="box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 11.9px; padding: 0.2em 0px; margin: 0px; border-radius: 3px; background-color: rgba(0, 0, 0, 0.0392157);"&gt;popcount( ballot() &amp;amp; lanes_mask_xxx() )&lt;/CODE&gt;&amp;nbsp;sequence to be emitted and the&amp;nbsp;&lt;A href="https://www.khronos.org/registry/spir-v/specs/1.1/SPIRV.html#_a_id_group_a_group_instructions" style="box-sizing: border-box; color: rgb(64, 120, 192); background-color: transparent;"&gt;OpGroupIAdd opcode&lt;/A&gt;&amp;nbsp;specification left as is.&lt;/P&gt;

&lt;P style="box-sizing: border-box; color: rgb(51, 51, 51); font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 14px; line-height: 22.4px; margin-bottom: 0px !important;"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 22 May 2016 17:44:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060334#M4200</guid>
      <dc:creator>allanmac1</dc:creator>
      <dc:date>2016-05-22T17:44:00Z</dc:date>
    </item>
    <item>
      <title>Hi Allan,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060335#M4201</link>
      <description>&lt;P&gt;Hi Allan,&lt;/P&gt;

&lt;P&gt;Couple of questions: 1) are you or your company a Khronos member? 2) does your company have an NDA with Intel in place?&lt;/P&gt;

&lt;P&gt;Our OpenCL driver architect just pointed out:&lt;/P&gt;

&lt;P style="margin: 0in 0in 0pt;"&gt;&lt;SPAN style="color:#1F497D"&gt;&lt;FONT face="Calibri" size="3"&gt;Of note, there’s also a related GLSL extension that the Vulkan folks are looking at adding:&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0in 0in 0pt;"&gt;&lt;SPAN style="color:#1F497D"&gt;&lt;A href="https://www.opengl.org/registry/specs/ARB/shader_ballot.txt"&gt;&lt;U&gt;&lt;FONT color="#0563c1" face="Calibri" size="3"&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;A href="https://www.opengl.org/registry/specs/ARB/shader_ballot.txt" target="_blank"&gt;https://www.opengl.org/registry/specs/ARB/shader_ballot.txt&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;We have a lot of activity on this subject but nothing to announce publicly yet.&lt;/DIV&gt;</description>
      <pubDate>Tue, 24 May 2016 23:58:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060335#M4201</guid>
      <dc:creator>Robert_I_Intel</dc:creator>
      <dc:date>2016-05-24T23:58:13Z</dc:date>
    </item>
    <item>
      <title>I could do with ballot too. </title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060336#M4202</link>
      <description>&lt;P&gt;I could do with ballot too.&amp;nbsp; I am a Khronos member, but it's for an opensource project, so I dont think that will be useful particularly.&amp;nbsp; Note that I'm fine with the solution being vendor-specific, eg inline assembler.&amp;nbsp; For example, ballot is available on NVIDIA, using inline assembler, even though NVIDIA itself only supports OpenCL 1.2 &lt;A href="https://github.com/hughperkins/neonCl-underconstruction/blob/52d46b105dd9780ef7120831e143bba466c0d165/neoncl/backends/kernels/cl/convolution_cl.py#L615-L630" target="_blank"&gt;https://github.com/hughperkins/neonCl-underconstruction/blob/52d46b105dd9780ef7120831e143bba466c0d165/neoncl/backends/kernels/cl/convolution_cl.py#L615-L630&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2016 03:14:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Broadwell-IGP-needs-more-sub-group-functions/m-p/1060336#M4202</guid>
      <dc:creator>Hugh_P_</dc:creator>
      <dc:date>2016-06-30T03:14:36Z</dc:date>
    </item>
  </channel>
</rss>

