<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic If each thread is in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035399#M43704</link>
    <description>&lt;P&gt;If each thread is initializing its own data page, the sequential effect on thread ramp up may limit effective parallel scaling.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Transparent huge pages may succeed in automatically finding much of the beneficial use of huge pages.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Mic has hardware provision for medium pages but somehow they didn't work out.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 01 Aug 2015 16:54:52 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2015-08-01T16:54:52Z</dc:date>
    <item>
      <title>offload large overhead</title>
      <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035395#M43700</link>
      <description>&lt;P&gt;Dear forum,&lt;/P&gt;

&lt;P&gt;I have come across some problems when using offload pragma.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Here is the pseudo-code.&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;AllocateMemMIC(data); // nocopy(data:length(numElement) ALLOC RETAIN
AllocateMemMIC(result);
for(int i = 0; i &amp;lt; 4; ++i)
{
    UpdateData(data); // change data content
    
    MemcpyHostToMIC(data);

    OffloadCompute(data, result); // in(data:length(0) REUSE RETAIN), in(result:length(0) REUSE RETAIN)

    MemcpyMICToHost(result);

    UpdateResult(result); // accumulate result
}
FreeMemMIC(data);
FreeMemMIC(result);&lt;/PRE&gt;

&lt;P&gt;I used&amp;nbsp;OFFLOAD_REPORT for the timing information, and&amp;nbsp;OFFLOAD_INIT=on_start to initialize the device earlier than main(). The strange thing is:&lt;/P&gt;

&lt;P&gt;1. AllocateMemMIC() appears extremely slow. 2 GB memory would take ~7 seconds to allocate.&lt;/P&gt;

&lt;P&gt;2. OffloadCompute() in loop No. 0 appears extremely slow, taking ~14 seconds, while the same call in the subsequent loop No. 1, 2, 3 only takes ~0.8 seconds.&lt;/P&gt;

&lt;P&gt;Could anyone give me some hints? Many thanks.&lt;/P&gt;

&lt;P&gt;BTW:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;MIC model --- 5110p&lt;/LI&gt;
	&lt;LI&gt;compiler --- icpc 15.0.3&lt;/LI&gt;
	&lt;LI&gt;mpss ---&amp;nbsp;3.5.1&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Jul 2015 21:12:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035395#M43700</guid>
      <dc:creator>King_Crimson</dc:creator>
      <dc:date>2015-07-31T21:12:02Z</dc:date>
    </item>
    <item>
      <title>The Linux process in the MIC,</title>
      <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035396#M43701</link>
      <description>&lt;P&gt;The Linux process in the MIC, upon start, has a heap with addresses assigned but not mapped to physical RAM. Other than for the allocated node header (which the heap manager touches), the mapping does not occur on the allocation. Rather it occurs during runtime at the first time you touch the memory (page granularity). Each "first touch" to a page causes a page fault to the Linux O/S, which then maps the address to RAM (it may wipe the RAM too), then returns to the app. This repeats for each page in the allocation.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 31 Jul 2015 22:07:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035396#M43701</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2015-07-31T22:07:29Z</dc:date>
    </item>
    <item>
      <title>Lots of thanks for your</title>
      <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035397#M43702</link>
      <description>&lt;P&gt;Lots of thanks for your insight. That explains why using large page size helps.&lt;/P&gt;</description>
      <pubDate>Sat, 01 Aug 2015 01:17:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035397#M43702</guid>
      <dc:creator>King_Crimson</dc:creator>
      <dc:date>2015-08-01T01:17:23Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;That explains why using</title>
      <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035398#M43703</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;That explains why using large page size helps.&lt;/P&gt;

&lt;P&gt;Sometimes it doesn't. The cache system not only has a capacity in KB (multiples of cache line size) but also has an additional restriction on the number of different pages (held in TLB). Using Large Page Size &lt;EM&gt;&lt;STRONG&gt;may &lt;/STRONG&gt;&lt;/EM&gt;reduce the number of different pages that can be mapped (at each cache level). Therefore, while this may speed up "first touch" initialization, it may also slow down the application later on. Each application may have different page size requirements. You can find this out with testing.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 01 Aug 2015 16:04:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035398#M43703</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2015-08-01T16:04:03Z</dc:date>
    </item>
    <item>
      <title>If each thread is</title>
      <link>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035399#M43704</link>
      <description>&lt;P&gt;If each thread is initializing its own data page, the sequential effect on thread ramp up may limit effective parallel scaling.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Transparent huge pages may succeed in automatically finding much of the beneficial use of huge pages.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Mic has hardware provision for medium pages but somehow they didn't work out.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Aug 2015 16:54:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/offload-large-overhead/m-p/1035399#M43704</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-08-01T16:54:52Z</dc:date>
    </item>
  </channel>
</rss>

