<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Long overhead for initial use of multiple queues in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314875#M1556</link>
    <description>&lt;P&gt;Yes, I understand how to make it re-do the JIT compilation overhead each pass by creating new queues, so taking a long time on all measurements.&amp;nbsp;&amp;nbsp; It isn't a very useful example, since the objective is, in general, to reduce the execution time.&lt;/P&gt;
&lt;P&gt;The real problem here is that the JIT compilation time appears to increase linearly for each added queue.&amp;nbsp; For example, if I change N_QUEUES from 100 to 1000, execution time goes up&lt;/P&gt;
&lt;P&gt;for cpu:&lt;/P&gt;
&lt;P&gt;N_QUEUES=1000, N=100&lt;BR /&gt;xpu-time :123.282&lt;BR /&gt;xpu-time :0.0236045&lt;BR /&gt;xpu-time :0.0236197&lt;/P&gt;
&lt;P&gt;for gpu:&lt;/P&gt;
&lt;P&gt;N_QUEUES=1000, N=100&lt;BR /&gt;xpu-time :147.845&lt;BR /&gt;xpu-time :0.931922&lt;BR /&gt;xpu-time :0.954865&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, from the same example, with N_QUEUES=1000 and using cpu_selector, exit from program takes over a minute.&lt;/P&gt;
&lt;P&gt;It exits immediately when using gpu_selector.&amp;nbsp;&amp;nbsp;&amp;nbsp; This is a different issue, but you already have the code.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;On the JIT compilation overhead ... why should there be the linear overhead for every queue/kernel?&amp;nbsp; Isn't this a good candidate for doing multiple compilations in parallel?&lt;/P&gt;</description>
    <pubDate>Wed, 15 Sep 2021 19:12:10 GMT</pubDate>
    <dc:creator>JNorw</dc:creator>
    <dc:date>2021-09-15T19:12:10Z</dc:date>
    <item>
      <title>Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1313720#M1546</link>
      <description>&lt;P&gt;I'm setting up 100 queues, each with 1k float shared mem and doing a simple parallel_for in the kernel that does a single multiply and assign to fill each shared array, then a wait on each of the return events.&lt;/P&gt;
&lt;P&gt;I time three passes.&lt;/P&gt;
&lt;P&gt;on cpu:&lt;/P&gt;
&lt;P&gt;N_QUEUES=100, N=1000&lt;BR /&gt;xpu-time :1.26718&lt;BR /&gt;xpu-time :0.00582737&lt;BR /&gt;xpu-time :0.00395401&lt;BR /&gt;Passed&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;on GPU:&lt;/P&gt;
&lt;P&gt;N_QUEUES=100, N=1000&lt;BR /&gt;xpu-time :10.3693&lt;BR /&gt;xpu-time :0.0118774&lt;BR /&gt;xpu-time :0.0101834&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In both cases, the arrays are filled with expected values, but it seems to me that the time for the first execution is extremely long.&amp;nbsp; I'm wondering if this is some known start-up overhead for tbb, since the book examples usually have some warm-up pass that excludes measuring the initial task.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Assuming that is the issue, Is there some method of initializing tbb for a large number of tasks rather than adding them one at a time?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm substituting this for the fig_1_1_hello in the dpc++ book, and build with the makefile created by its cmake on ubuntu 20.04 linux, using the current docker distribution for oneapi.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;After thinking about this some more, is this perhaps the JIT compilation time being added for each queue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;After searching for JIT issues, I see that this is a known issue of around 140ms per kernel.&amp;nbsp; I changed my queue count to 1000 and saw initial pass go up to around 140 secs on GPU.&amp;nbsp; The proposed solutions are to use Ahead of Time Compilation.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I also noted, during further testing, that there is a very long exit time from the app, so there must be some associated clean-up for multiple queues.&amp;nbsp; Is there some uninstall of the kernel executables?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'll attach my example code.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 11 Sep 2021 00:12:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1313720#M1546</guid>
      <dc:creator>JNorw</dc:creator>
      <dc:date>2021-09-11T00:12:09Z</dc:date>
    </item>
    <item>
      <title>Re:Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314138#M1548</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thank you for providing the reproducible code. We are looking into it and we will get back to you soon.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 13 Sep 2021 12:32:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314138#M1548</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-09-13T12:32:21Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314404#M1550</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We looked into your code. We can see that the queues(100) which are launched in the 1st iteration have been used in second &amp;amp; third iterations without free-ing the queues (at the end of 1st iteration). So, as the queue initialization takes time, we can see that 1st iteration is taking time greater than 2nd &amp;amp; 3rd iterations. Since the same queues have been used in the 2nd &amp;amp; 3rd iterations it took less time compared to 1st one.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We attached a sample code, where we free the queue at the end of each iteration. We can see that each iteration is taking a similar time for execution as shown in the below screenshot.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SantoshY_Intel_0-1631614419595.png" style="width: 559px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/19369iA22FE15625741DAA/image-dimensions/559x175?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="559" height="175" role="button" title="SantoshY_Intel_0-1631614419595.png" alt="SantoshY_Intel_0-1631614419595.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Sep 2021 10:20:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314404#M1550</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-09-14T10:20:54Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314875#M1556</link>
      <description>&lt;P&gt;Yes, I understand how to make it re-do the JIT compilation overhead each pass by creating new queues, so taking a long time on all measurements.&amp;nbsp;&amp;nbsp; It isn't a very useful example, since the objective is, in general, to reduce the execution time.&lt;/P&gt;
&lt;P&gt;The real problem here is that the JIT compilation time appears to increase linearly for each added queue.&amp;nbsp; For example, if I change N_QUEUES from 100 to 1000, execution time goes up&lt;/P&gt;
&lt;P&gt;for cpu:&lt;/P&gt;
&lt;P&gt;N_QUEUES=1000, N=100&lt;BR /&gt;xpu-time :123.282&lt;BR /&gt;xpu-time :0.0236045&lt;BR /&gt;xpu-time :0.0236197&lt;/P&gt;
&lt;P&gt;for gpu:&lt;/P&gt;
&lt;P&gt;N_QUEUES=1000, N=100&lt;BR /&gt;xpu-time :147.845&lt;BR /&gt;xpu-time :0.931922&lt;BR /&gt;xpu-time :0.954865&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, from the same example, with N_QUEUES=1000 and using cpu_selector, exit from program takes over a minute.&lt;/P&gt;
&lt;P&gt;It exits immediately when using gpu_selector.&amp;nbsp;&amp;nbsp;&amp;nbsp; This is a different issue, but you already have the code.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;On the JIT compilation overhead ... why should there be the linear overhead for every queue/kernel?&amp;nbsp; Isn't this a good candidate for doing multiple compilations in parallel?&lt;/P&gt;</description>
      <pubDate>Wed, 15 Sep 2021 19:12:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1314875#M1556</guid>
      <dc:creator>JNorw</dc:creator>
      <dc:date>2021-09-15T19:12:10Z</dc:date>
    </item>
    <item>
      <title>Re:Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1316702#M1579</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;To understand your query better, could you please provide us the use-case or the intention behind creating 1000 queues and doing the same task with each queue?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 23 Sep 2021 10:58:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1316702#M1579</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-09-23T10:58:23Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318731#M1600</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We haven't heard back from you. Could you please provide us the use-case or the intention behind creating 1000 queues and doing the same task with each queue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 02 Oct 2021 04:58:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318731#M1600</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-10-02T04:58:41Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318747#M1601</link>
      <description>&lt;P&gt;I don't have a specific use case that requires that many queues.&lt;/P&gt;
&lt;P&gt;I suspect FPGA tasks with kernel to kernel pipes could require explicit locking of kernel tasks&amp;nbsp; to hardware. Perhaps that would be a use case.&lt;/P&gt;
&lt;P&gt;I see the taskflow project referencing apps that require millions of tasks.&amp;nbsp; Perhaps their application could provide a use case. See &lt;A href="https://taskflow.github.io/." target="_blank"&gt;https://taskflow.github.io/. &lt;/A&gt;&lt;/P&gt;
&lt;P&gt;This Intel video, &lt;A href="https://youtu.be/p7HWSciMAms?t=994" target="_blank"&gt;https://youtu.be/p7HWSciMAms?t=994&lt;/A&gt;, suggests using multiple in-order queues as a way to obtain more parallelism.&amp;nbsp; Perhaps there is a use case there.&lt;/P&gt;
&lt;P&gt;I would guess,&amp;nbsp; that neural net execution with high batch size could specify a queue per batch index.&amp;nbsp; For example 256 batch sizes are used in the Intel Ponte Vecchio resnet configurations at &lt;A href="https://edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/?r=1849242047" target="_blank"&gt;https://edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/?r=1849242047 &lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 02 Oct 2021 07:11:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318747#M1601</guid>
      <dc:creator>JNorw</dc:creator>
      <dc:date>2021-10-02T07:11:53Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318751#M1602</link>
      <description>&lt;P&gt;This resnet-50 description shows large numbers of filter channels ... 1024 and 2048 in conv4 and conv5 layers.&amp;nbsp; Assuming you are trying to lock specific filter channels to specific cores, so the specific filter parameters stay in specific core caches,&amp;nbsp; this could provide a use case that could benefit from specifying a large number of queues.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.researchgate.net/figure/ResNet-50-architecture-26-shown-with-the-residual-units-the-size-of-the-filters-and_fig1_338603223" target="_blank" rel="noopener"&gt;https://www.researchgate.net/figure/ResNet-50-architecture-26-shown-with-the-residual-units-the-size-of-the-filters-and_fig1_338603223&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 02 Oct 2021 07:43:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1318751#M1602</guid>
      <dc:creator>JNorw</dc:creator>
      <dc:date>2021-10-02T07:43:31Z</dc:date>
    </item>
    <item>
      <title>Re:Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1323098#M1646</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We are analyzing your issue and we will get back to you soon.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 19 Oct 2021 11:55:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1323098#M1646</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-10-19T11:55:49Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1323985#M1653</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We analyzed your code and attached the log files regarding the API timing results for your reference.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For Q=100:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Function,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Calls,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Time (%)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; zeModuleCreate,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;100,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;97.22&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;zeCommandQueueExecuteCommandLists,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;300,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1.98&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For Q=1000:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Function,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Calls,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Time (%)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; zeModuleCreate,&amp;nbsp; &amp;nbsp; &amp;nbsp; 1000,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;94.76&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;zeCommandQueueExecuteCommandLists,&amp;nbsp; &amp;nbsp; &amp;nbsp; 3000,&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;4.39&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;From the above analysis, we can see that the module creation for each queue has been done only Q(but not 3xQ) times which took the maximum of time(97.22% for Q=100, 94.76% for Q=1000). As a result, we can see a long overhead for the initial use of multiple queues but not in the successive iterations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As the module creation will be done in a linear fashion, so we can expect a linear increase in time as the increase in No. of queues increases from 100 to 1000.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;</description>
      <pubDate>Mon, 08 Nov 2021 09:18:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1323985#M1653</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-11-08T09:18:06Z</dc:date>
    </item>
    <item>
      <title>Re: Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1325792#M1662</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We haven't heard back from you. Could you please provide us with an update on your issue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Nov 2021 09:17:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1325792#M1662</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-11-08T09:17:21Z</dc:date>
    </item>
    <item>
      <title>Re:Long overhead for initial use of multiple queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1327517#M1673</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 05 Nov 2021 11:46:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Long-overhead-for-initial-use-of-multiple-queues/m-p/1327517#M1673</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-11-05T11:46:09Z</dc:date>
    </item>
  </channel>
</rss>

