<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Understanding Performance of Modified Vector Add Example in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1252375#M938</link>
    <description>&lt;P&gt;Hi Austin,&lt;/P&gt;&lt;P&gt;We are escalating this thread to the Subject Matter Expert(SME) who will guide you further.&lt;/P&gt;&lt;P&gt;Have a Good day!&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Thanks &amp;amp; Regards&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Goutham&lt;/I&gt; &lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 02 Feb 2021 12:57:47 GMT</pubDate>
    <dc:creator>GouthamK_Intel</dc:creator>
    <dc:date>2021-02-02T12:57:47Z</dc:date>
    <item>
      <title>Understanding Performance of Modified Vector Add Example</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1252069#M933</link>
      <description>&lt;P&gt;I'd like to understand the excess overhead I'm measuring when submitting a sycl command group in the attached modified vector add example. &amp;nbsp;Some key points:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Vector add works on 100M complex elements and writes to output vector sum&lt;/LI&gt;
&lt;LI&gt;Target: CPU with 2 cores&lt;/LI&gt;
&lt;LI&gt;Profiling overall command group, and using sycl profiling to track submit time and execution time&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I'm finding that the overall execution time is about 7 times longer than the execution + submit time. &amp;nbsp;I'm curious to know the source of the additional overhead when submitting the command group.&lt;/P&gt;
&lt;P&gt;Even when the kernel has no work (comment contents in parallel for) and set&amp;nbsp;array_size = 1, the overall overhead is still half a second (much larger than the kernel submit time or execution time)&lt;/P&gt;
&lt;P&gt;Note: I'm using the latest Intel oneAPI DPC++ Compiler included in the Basekit_p_2021.1.0.2659 release.&lt;/P&gt;
&lt;P&gt;Thanks for your help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;&lt;/DIV&gt;
&lt;DIV&gt;#include &amp;lt;complex&amp;gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;constexpr sycl::access::mode dp_read = sycl::access::mode::read;&lt;/DIV&gt;
&lt;DIV&gt;constexpr sycl::access::mode dp_write = sycl::access::mode::write;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;constexpr size_t array_size = 100000000;&lt;/DIV&gt;
&lt;DIV&gt;typedef std::vector&amp;lt;std::complex&amp;lt;float&amp;gt;&amp;gt; IntArray;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;// Function description: initialize the array from 0 to array_size-1&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;void initialize_array(IntArray &amp;amp;a) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; for (size_t i = 0; i &amp;lt; a.size(); i++){&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;float x = (float) i;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;float y = -x;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;a[i] = std::complex&amp;lt;float&amp;gt;(x, y);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; }&lt;/DIV&gt;
&lt;DIV&gt;}&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;// Compute vector addition in DPC++ on device: sum of the data is returned in&lt;/DIV&gt;
&lt;DIV&gt;// 3rd parameter "sum_parallel"&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;void VectorAddInDPCPP(const IntArray &amp;amp;addend_1, const IntArray &amp;amp;addend_2,&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; IntArray &amp;amp;sum_parallel) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto property_list = cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; sycl::queue q = cl::sycl::queue(sycl::cpu_selector{}, property_list);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // print out the device information used for the kernel code&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "Device: " &amp;lt;&amp;lt; q.get_device().get_info&amp;lt;sycl::info::device::name&amp;gt;()&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "Compute Units: " &amp;lt;&amp;lt; q.get_device().get_info&amp;lt;sycl::info::device::max_compute_units&amp;gt;()&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // create the range object for the arrays managed by the buffer&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; sycl::range&amp;lt;1&amp;gt; num_items{array_size};&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; sycl::buffer&amp;lt;std::complex&amp;lt;float&amp;gt;, 1&amp;gt; addend_1_buf(addend_1.data(), num_items);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; sycl::buffer&amp;lt;std::complex&amp;lt;float&amp;gt;, 1&amp;gt; addend_2_buf(addend_2.data(), num_items);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; sycl::buffer&amp;lt;std::complex&amp;lt;float&amp;gt;, 1&amp;gt; sum_buf(sum_parallel.data(), num_items);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto start_overall = std::chrono::system_clock::now();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // submit a command group to the queue by a lambda function that&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // contains the data access permission and device computation (kernel)&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto event = q.submit([&amp;amp;](sycl::handler &amp;amp;h) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; auto addend_1_accessor = addend_1_buf.get_access&amp;lt;dp_read&amp;gt;(h);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; auto addend_2_accessor = addend_2_buf.get_access&amp;lt;dp_read&amp;gt;(h);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; auto sum_accessor = sum_buf.get_access&amp;lt;dp_write&amp;gt;(h);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; h.parallel_for(num_items, [=](sycl::id&amp;lt;1&amp;gt; i) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; float real = addend_1_accessor[i].real() + addend_2_accessor[i].real();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; float imag = addend_1_accessor[i].imag() + addend_2_accessor[i].imag();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; sum_accessor[i] = std::complex&amp;lt;float&amp;gt;(real, imag);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; });&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; });&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; event.wait();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto end_overall = std::chrono::system_clock::now();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto submit_time = event.get_profiling_info&amp;lt;cl::sycl::info::event_profiling::command_submit&amp;gt;();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto start_time = event.get_profiling_info&amp;lt;cl::sycl::info::event_profiling::command_start&amp;gt;();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto end_time = event.get_profiling_info&amp;lt;cl::sycl::info::event_profiling::command_end&amp;gt;();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto submission_time = (start_time - submit_time) / 1000000.0f;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "Submit Time: " &amp;lt;&amp;lt; submission_time &amp;lt;&amp;lt; " ms" &amp;lt;&amp;lt; &amp;nbsp;std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto execution_time = (end_time - start_time) / 1000000.0f;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "Execution Time: " &amp;lt;&amp;lt; execution_time &amp;lt;&amp;lt; " ms" &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; auto execution_overall = std::chrono::duration_cast&amp;lt;std::chrono::milliseconds&amp;gt;(end_overall - start_overall);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "Overall Execution Time: " &amp;lt;&amp;lt; execution_overall.count() &amp;lt;&amp;lt; " ms" &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;}&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;// Demonstrate summation of arrays both in scalar on CPU and parallel on device&lt;/DIV&gt;
&lt;DIV&gt;//************************************&lt;/DIV&gt;
&lt;DIV&gt;int main() {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // Vector Add using SYCL&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; IntArray addend_1 (array_size);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; IntArray addend_2 (array_size);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; IntArray sum_parallel (array_size);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; initialize_array(addend_1);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; initialize_array(addend_2);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; VectorAddInDPCPP(addend_1, addend_2, sum_parallel);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // Vector Add on host single threaded&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; IntArray sum_scalar (array_size);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; for (size_t i = 0; i &amp;lt; sum_scalar.size(); i++){&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;float real = addend_1[i].real() + addend_2[i].real();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;float imag = addend_1[i].imag() + addend_2[i].imag();&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;sum_scalar[i] = std::complex&amp;lt;float&amp;gt;(real, imag);&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; }&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; // Verify both sum arrays are equal&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; for (size_t i = 0; i &amp;lt; sum_parallel.size(); i++) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; if ((sum_parallel[i] != sum_scalar[i])) {&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; std::cout &amp;lt;&amp;lt; "i = " &amp;lt;&amp;lt; i &amp;lt;&amp;lt; " , sum_parallel[i].real() = " &amp;lt;&amp;lt; sum_parallel[i].real() &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; std::cout &amp;lt;&amp;lt; "i = " &amp;lt;&amp;lt; i &amp;lt;&amp;lt; " , sum_scalar[i].real() = " &amp;lt;&amp;lt; sum_scalar[i].real() &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; std::cout &amp;lt;&amp;lt; "fail" &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; return -1;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; &amp;nbsp; }&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; }&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; std::cout &amp;lt;&amp;lt; "success" &amp;lt;&amp;lt; std::endl;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp; return 0;&lt;/DIV&gt;
&lt;DIV&gt;}&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Feb 2021 20:55:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1252069#M933</guid>
      <dc:creator>Austin1</dc:creator>
      <dc:date>2021-02-01T20:55:35Z</dc:date>
    </item>
    <item>
      <title>Re:Understanding Performance of Modified Vector Add Example</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1252375#M938</link>
      <description>&lt;P&gt;Hi Austin,&lt;/P&gt;&lt;P&gt;We are escalating this thread to the Subject Matter Expert(SME) who will guide you further.&lt;/P&gt;&lt;P&gt;Have a Good day!&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Thanks &amp;amp; Regards&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Goutham&lt;/I&gt; &lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 02 Feb 2021 12:57:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1252375#M938</guid>
      <dc:creator>GouthamK_Intel</dc:creator>
      <dc:date>2021-02-02T12:57:47Z</dc:date>
    </item>
    <item>
      <title>Re:Understanding Performance of Modified Vector Add Example</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1255713#M955</link>
      <description>&lt;P&gt;This is an interesting finding. &lt;/P&gt;&lt;P&gt;There're a couple important overhead for each kernel: &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;JIT compilation for each kernel: it happens once for each kernel&lt;/LI&gt;&lt;LI&gt;Data transfer between CPU memory &amp;amp; GPU memory &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Those are not counted by the submit-time and execution-time. But the overall system clock you have included all, so it's slower. &lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;There is an open-source tool &lt;A href="https://github.com/intel/pti-gpu/tree/master/samples/ze_tracer" rel="noopener noreferrer" target="_blank"&gt;https://github.com/intel/pti-gpu/tree/master/samples/ze_tracer&lt;/A&gt; that can show more details on where the time spent. It may help. &lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Hope answers your questions. &lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Jennifer &lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 12 Feb 2021 21:22:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Understanding-Performance-of-Modified-Vector-Add-Example/m-p/1255713#M955</guid>
      <dc:creator>JenniferJ</dc:creator>
      <dc:date>2021-02-12T21:22:03Z</dc:date>
    </item>
  </channel>
</rss>

