<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thank you Igor, in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935259#M17114</link>
    <description>&lt;P&gt;Thank you Igor,&lt;/P&gt;
&lt;P&gt;now it's much clearer. It was extremely kind of you!&lt;/P&gt;</description>
    <pubDate>Thu, 26 Sep 2013 12:27:30 GMT</pubDate>
    <dc:creator>Daniil_Osokin</dc:creator>
    <dc:date>2013-09-26T12:27:30Z</dc:date>
    <item>
      <title>Asynchronous C/C++ GPU optimization</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935255#M17110</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;
&lt;P&gt;If I need to handle quite big image (for example, 4k), could it be efficient to split input image into tiles (tile size can be based on number of GMA cores, to provide full GPU utilzation) and execute whole&amp;nbsp;sequence of image processing operations by tiles, e.g. compute Sobel on 0-tile, next on 1st-tile, ...?&lt;/P&gt;</description>
      <pubDate>Sat, 21 Sep 2013 13:57:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935255#M17110</guid>
      <dc:creator>Daniil_Osokin</dc:creator>
      <dc:date>2013-09-21T13:57:02Z</dc:date>
    </item>
    <item>
      <title>Hi Daniil,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935256#M17111</link>
      <description>&lt;P&gt;Hi Daniil,&lt;/P&gt;
&lt;P&gt;it's not clear either&amp;nbsp;you are&amp;nbsp;going to use IPP Async functionality for this purpose or are going to use some own implementation. If talking about IPP Async - you should not care about tiling - it is done internaly. Almost all Async functions work with 16x8 pixel blocks for the best Gen EU utilization.&lt;/P&gt;
&lt;P&gt;regards, Igor&lt;/P&gt;</description>
      <pubDate>Wed, 25 Sep 2013 06:45:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935256#M17111</guid>
      <dc:creator>Igor_A_Intel</dc:creator>
      <dc:date>2013-09-25T06:45:24Z</dc:date>
    </item>
    <item>
      <title>Igor, thanks for response!</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935257#M17112</link>
      <description>&lt;P&gt;Igor, thanks for response!&lt;/P&gt;
&lt;P&gt;I'm going to use IPP Async, but I want to achive an additional speedup from data locality, e.g. execute whole set of image processing operations through IPP&amp;nbsp;Async on first image tile (not the GPU tile), then on next tile. So this pipeline is like: divide image in slices and for each slice call same sequence of hpp* image processing functions. Could it be efficient in comparison with reqular pipeline: passing whole image through sequence hpp* functions?&lt;/P&gt;</description>
      <pubDate>Wed, 25 Sep 2013 08:54:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935257#M17112</guid>
      <dc:creator>Daniil_Osokin</dc:creator>
      <dc:date>2013-09-25T08:54:49Z</dc:date>
    </item>
    <item>
      <title>It depends on image size -</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935258#M17113</link>
      <description>&lt;P&gt;It depends on image size - Gen video memory can hold surface max 8Kx8K byte - so if your image has greater dimentions - it is better to apply tiling at the application side. Too small tiles will lead to the corresponding number of enqueues (==number of tiles), each enqueue adds a huge const overhead to processing as it goes through video driver. Async library has some internal logic/analyzer that will be extended in the future to the full graph analyzer like DMIP.&amp;nbsp;So "slices" approach can be effective for classic sync IPP, while for Async it's better to use regular pipeline if image fits in 8x8k.&lt;/P&gt;
&lt;P&gt;regards, Igor&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2013 09:25:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935258#M17113</guid>
      <dc:creator>Igor_A_Intel</dc:creator>
      <dc:date>2013-09-26T09:25:40Z</dc:date>
    </item>
    <item>
      <title>Thank you Igor,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935259#M17114</link>
      <description>&lt;P&gt;Thank you Igor,&lt;/P&gt;
&lt;P&gt;now it's much clearer. It was extremely kind of you!&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2013 12:27:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Asynchronous-C-C-GPU-optimization/m-p/935259#M17114</guid>
      <dc:creator>Daniil_Osokin</dc:creator>
      <dc:date>2013-09-26T12:27:30Z</dc:date>
    </item>
  </channel>
</rss>

