<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance difference between 32bit and 64bit memcpy in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856706#M2111</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; Was the CPUID checking performed in intel_fast_memcpy on every call (not a call once)?&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;No it's just the first memcpy which contains a lot of that sort of stuff. I suspect if I'd called some other CRT function first I'd have seen it there instead.&lt;BR /&gt;&lt;BR /&gt;Subsequent calls to memcpy get to the "meat" much quicker; about 20-30 instructions to hit the main loop, involving tests of stored values __intel_cpu_indicator, __intel_memcpy_mem_ops_method and __intel_memcpy_largest_cache_size (which were presumably all set up by the first call). &lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Thu, 12 Feb 2009 09:46:17 GMT</pubDate>
    <dc:creator>Tim_Day</dc:creator>
    <dc:date>2009-02-12T09:46:17Z</dc:date>
    <item>
      <title>Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856699#M2104</link>
      <description>We have Core2 machines (Dell T5400) with XP64.&lt;BR /&gt;&lt;BR /&gt;We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).&lt;BR /&gt;&lt;BR /&gt;While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves&lt;BR /&gt;&lt;BR /&gt;I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what &lt;BR /&gt;&lt;BR /&gt;Thanks for any insight.&lt;BR /&gt;Tim&lt;BR /&gt;</description>
      <pubDate>Tue, 27 Jan 2009 14:50:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856699#M2104</guid>
      <dc:creator>Tim_Day</dc:creator>
      <dc:date>2009-01-27T14:50:32Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856700#M2105</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/410300"&gt;Tim Day&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;We have Core2 machines (Dell T5400) with XP64.&lt;BR /&gt;&lt;BR /&gt;We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).&lt;BR /&gt;&lt;BR /&gt;While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves&lt;BR /&gt;&lt;BR /&gt;I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what &lt;BR /&gt;&lt;BR /&gt;Thanks for any insight.&lt;BR /&gt;Tim&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;You shouldexpect results like those you obtain with your own code-- that you get identical upper bounds on the copy bandwidth.&lt;BR /&gt;&lt;BR /&gt;I would guess that the different numbers you gotwith32 and 64 bit process are likely to be due to their executing different code. To verify that, one would have to look at the specific implementations. But that is the most likely cause-- different code giving different performance.&lt;BR /&gt;&lt;BR /&gt;For a piece of code with a very simple description, there are a very large number of different ways to write a memcpy(). You can get many different performances, depending on how you structure your code, and what sort of data that you feed to that code. But by and large, the same peak bandwidth should be obtainable in either 32 or 64 bit mode.&lt;BR /&gt;</description>
      <pubDate>Mon, 02 Feb 2009 17:58:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856700#M2105</guid>
      <dc:creator>Seth_A_Intel</dc:creator>
      <dc:date>2009-02-02T17:58:10Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856701#M2106</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
Hi Tim,&lt;BR /&gt;&lt;BR /&gt;It looks like the 32-bit implementation of memcpy is far from optimal for the modern CPU you're using. It's probably still compatible with 386 CPUs even, and not taking advantage of MMX or SSE. The 64-bit version knows the CPU supports everything up to at least SSE2 so it can achieve better bandwidth.&lt;BR /&gt;&lt;BR /&gt;In particular I believe the older 32-bit memcpy implementations make use of "rep movsd", which is microcoded on processors with out-of-order instruction execution. If Irecall correctly itgenerates one micro-instruction per cycle so unlike a regular copy loop it won't issue a load and a store in parallel. It might also impose limitations on prefetching and such.&lt;BR /&gt;&lt;BR /&gt;Anyway, what used to be fast on a 386 is no longer optimal today. So if performance is critical I'd certainly advise to use your own memcpy implementation that makes use of all the processor's capabilities.&lt;BR /&gt;&lt;BR /&gt;Cheers,&lt;BR /&gt;&lt;BR /&gt;Nicolas</description>
      <pubDate>Tue, 03 Feb 2009 16:14:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856701#M2106</guid>
      <dc:creator>capens__nicolas</dc:creator>
      <dc:date>2009-02-03T16:14:58Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856702#M2107</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;BR /&gt;Nicolas,&lt;BR /&gt;&lt;BR /&gt;Thank you for checking the implementation of memcpy() on the 32 bit system. That was the most likely cause.&lt;BR /&gt;&lt;BR /&gt;However, please let me update you on the REP MOVSD instruction. It is indeed, micro-coded, and there are differences between exactly how it is implemented between different Intel CPUs. However, it can (and does) issue multiple micro-instructions per cycle, and can issue loads and stores at the same time on most Intel CPU products. There are substantial improvements in its performance over the years, with the newest products having the best implementations. It is possible to get better performance with your own home grown code if you do everything right, but it is becoming more challenging to do that as REP MOVSD improves in performance. The latest optimization guide has more information about the performance and usage of this instruction. (see &lt;A href="http://www.intel.com/products/processor/manuals/" target="_blank"&gt;http://www.intel.com/products/processor/manuals/&lt;/A&gt;, section 2.2.6)&lt;/DIV&gt;</description>
      <pubDate>Tue, 03 Feb 2009 19:45:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856702#M2107</guid>
      <dc:creator>Seth_A_Intel</dc:creator>
      <dc:date>2009-02-03T19:45:31Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856703#M2108</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
Thanks Seth, I wasn't aware of the micro-code optimizations.</description>
      <pubDate>Wed, 04 Feb 2009 15:39:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856703#M2108</guid>
      <dc:creator>capens__nicolas</dc:creator>
      <dc:date>2009-02-04T15:39:40Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856704#M2109</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/336341"&gt;Seth Abraham (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;BR /&gt;I would guess that the different numbers you gotwith32 and 64 bit process are likely to be due to their executing different code. To verify that, one would have to look at the specific implementations. But that is the most likely cause-- different code giving different performance.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Yes I've just been digging into this in more depth and finally got to the bottom of it.&lt;BR /&gt;&lt;BR /&gt;In the below, dst and src are 512 MByte std::vector&lt;UNSIGNED char=""&gt;&lt;BR /&gt;I'm using the Intel 10.1.029 compiler and CRT on a Dell Precision T5400.&lt;BR /&gt;&lt;BR /&gt;On 64bit both&lt;BR /&gt; memcpy(&amp;amp;dst[0],&amp;amp;src[0],dst.size()) &lt;BR /&gt;and&lt;BR /&gt; memcpy(&amp;amp;dst[0],&amp;amp;src[0],N)  (where N is previously declared const size_t N=512*(1&amp;lt;&amp;lt;20);)&lt;BR /&gt;call &lt;BR /&gt; __intel_fast_memcpy&lt;BR /&gt;the bulk of which consists of:&lt;BR /&gt; 000000014004ED80  lea         rcx,[rcx+40h] &lt;BR /&gt; 000000014004ED84  lea         rdx,[rdx+40h] &lt;BR /&gt; 000000014004ED88  lea         r8,[r8-40h] &lt;BR /&gt; 000000014004ED8C  prefetchnta [rdx+180h] &lt;BR /&gt; 000000014004ED93  movdqu      xmm0,xmmword ptr [rdx-40h] &lt;BR /&gt; 000000014004ED98  movdqu      xmm1,xmmword ptr [rdx-30h] &lt;BR /&gt; 000000014004ED9D  cmp         r8,40h &lt;BR /&gt; 000000014004EDA1  movntdq     xmmword ptr [rcx-40h],xmm0 &lt;BR /&gt; 000000014004EDA6  movntdq     xmmword ptr [rcx-30h],xmm1 &lt;BR /&gt; 000000014004EDAB  movdqu      xmm2,xmmword ptr [rdx-20h] &lt;BR /&gt; 000000014004EDB0  movdqu      xmm3,xmmword ptr [rdx-10h] &lt;BR /&gt; 000000014004EDB5  movntdq     xmmword ptr [rcx-20h],xmm2 &lt;BR /&gt; 000000014004EDBA  movntdq     xmmword ptr [rcx-10h],xmm3 &lt;BR /&gt; 000000014004EDBF  jge         000000014004ED80 &lt;BR /&gt;and runs at ~2200 MByte/s.&lt;BR /&gt;&lt;BR /&gt;But on 32bit&lt;BR /&gt; memcpy(&amp;amp;dst[0],&amp;amp;src[0],dst.size()) &lt;BR /&gt;calls&lt;BR /&gt; __intel_fast_memcpy&lt;BR /&gt;the bulk of which consists of&lt;BR /&gt; 004447A0  sub         ecx,80h &lt;BR /&gt; 004447A6  movdqa      xmm0,xmmword ptr [esi] &lt;BR /&gt; 004447AA  movdqa      xmm1,xmmword ptr [esi+10h] &lt;BR /&gt; 004447AF  movdqa      xmmword ptr [edx],xmm0 &lt;BR /&gt; 004447B3  movdqa      xmmword ptr [edx+10h],xmm1 &lt;BR /&gt; 004447B8  movdqa      xmm2,xmmword ptr [esi+20h] &lt;BR /&gt; 004447BD  movdqa      xmm3,xmmword ptr [esi+30h] &lt;BR /&gt; 004447C2  movdqa      xmmword ptr [edx+20h],xmm2 &lt;BR /&gt; 004447C7  movdqa      xmmword ptr [edx+30h],xmm3 &lt;BR /&gt; 004447CC  movdqa      xmm4,xmmword ptr [esi+40h] &lt;BR /&gt; 004447D1  movdqa      xmm5,xmmword ptr [esi+50h] &lt;BR /&gt; 004447D6  movdqa      xmmword ptr [edx+40h],xmm4 &lt;BR /&gt; 004447DB  movdqa      xmmword ptr [edx+50h],xmm5 &lt;BR /&gt; 004447E0  movdqa      xmm6,xmmword ptr [esi+60h] &lt;BR /&gt; 004447E5  movdqa      xmm7,xmmword ptr [esi+70h] &lt;BR /&gt; 004447EA  add         esi,80h &lt;BR /&gt; 004447F0  movdqa      xmmword ptr [edx+60h],xmm6 &lt;BR /&gt; 004447F5  movdqa      xmmword ptr [edx+70h],xmm7 &lt;BR /&gt; 004447FA  add         edx,80h &lt;BR /&gt; 00444800  cmp         ecx,80h &lt;BR /&gt; 00444806  jge         004447A0&lt;BR /&gt;and runs at ~1350 MByte/s only.&lt;BR /&gt;&lt;BR /&gt;HOWEVER,&lt;BR /&gt; memcpy(&amp;amp;dst[0],&amp;amp;src[0],N)  (where N is previously declared const size_t N=512*(1&amp;lt;&amp;lt;20);)&lt;BR /&gt;compiles (on 32bit) to a direct call to&lt;BR /&gt; __intel_VEC_memcpy&lt;BR /&gt;the bulk of which consists of&lt;BR /&gt; 0043FF40  movdqa      xmm0,xmmword ptr [esi] &lt;BR /&gt; 0043FF44  movdqa      xmm1,xmmword ptr [esi+10h] &lt;BR /&gt; 0043FF49  movdqa      xmm2,xmmword ptr [esi+20h] &lt;BR /&gt; 0043FF4E  movdqa      xmm3,xmmword ptr [esi+30h] &lt;BR /&gt; 0043FF53  movntdq     xmmword ptr [edi],xmm0 &lt;BR /&gt; 0043FF57  movntdq     xmmword ptr [edi+10h],xmm1 &lt;BR /&gt; 0043FF5C  movntdq     xmmword ptr [edi+20h],xmm2 &lt;BR /&gt; 0043FF61  movntdq     xmmword ptr [edi+30h],xmm3 &lt;BR /&gt; 0043FF66  movdqa      xmm4,xmmword ptr [esi+40h] &lt;BR /&gt; 0043FF6B  movdqa      xmm5,xmmword ptr [esi+50h] &lt;BR /&gt; 0043FF70  movdqa      xmm6,xmmword ptr [esi+60h] &lt;BR /&gt; 0043FF75  movdqa      xmm7,xmmword ptr [esi+70h] &lt;BR /&gt; 0043FF7A  movntdq     xmmword ptr [edi+40h],xmm4 &lt;BR /&gt; 0043FF7F  movntdq     xmmword ptr [edi+50h],xmm5 &lt;BR /&gt; 0043FF84  movntdq     xmmword ptr [edi+60h],xmm6 &lt;BR /&gt; 0043FF89  movntdq     xmmword ptr [edi+70h],xmm7 &lt;BR /&gt; 0043FF8E  lea         esi,[esi+80h] &lt;BR /&gt; 0043FF94  lea         edi,[edi+80h] &lt;BR /&gt; 0043FF9A  dec         ecx  &lt;BR /&gt; 0043FF9B  jne         ___intel_VEC_memcpy+244h (43FF40h) &lt;BR /&gt;and runs at ~2100MByte/s.&lt;BR /&gt;&lt;BR /&gt;I withdraw the claim that my own memcpy-like SSE code suffers from a&lt;BR /&gt;similar ~1300 MByte bandwidth limit in 32bit builds; I now don't have&lt;BR /&gt;any problems getting &amp;gt;2GByte/s on 32 or 64bit; the trick (as the above&lt;BR /&gt;results hint) is to use non-temporal ("streaming") stores (e.g&lt;BR /&gt;_mm_stream_ps intrinsic).&lt;BR /&gt;&lt;BR /&gt;It seems a bit strange that the 32bit "dst.size()"-invoked memcpy&lt;BR /&gt;doesn't eventually call the faster "movnt" version (if you step&lt;BR /&gt;into memcpy there is the most incredible amount of CPUID checking&lt;BR /&gt;and heuristic logic e.g comparing number of bytes to be copied with&lt;BR /&gt;cache size etc before it goes anywhere near your actual data) but&lt;BR /&gt;at least I understand the observed behaviour now (and it's&lt;BR /&gt;down to simple code differences, not SysWow64 or H/W related&lt;BR /&gt;as previously suspected).  Arguably a bug in Intel CRT, or maybe&lt;BR /&gt;there are good reasons for it being the way it is ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/UNSIGNED&gt;</description>
      <pubDate>Wed, 11 Feb 2009 13:15:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856704#M2109</guid>
      <dc:creator>Tim_Day</dc:creator>
      <dc:date>2009-02-11T13:15:44Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856705#M2110</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Tim,&lt;BR /&gt;&lt;BR /&gt;Was the CPUID checking performed in intel_fast_memcpy on every call (not a call once)?&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 11 Feb 2009 20:07:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856705#M2110</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-02-11T20:07:16Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856706#M2111</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; Was the CPUID checking performed in intel_fast_memcpy on every call (not a call once)?&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;No it's just the first memcpy which contains a lot of that sort of stuff. I suspect if I'd called some other CRT function first I'd have seen it there instead.&lt;BR /&gt;&lt;BR /&gt;Subsequent calls to memcpy get to the "meat" much quicker; about 20-30 instructions to hit the main loop, involving tests of stored values __intel_cpu_indicator, __intel_memcpy_mem_ops_method and __intel_memcpy_largest_cache_size (which were presumably all set up by the first call). &lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 12 Feb 2009 09:46:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856706#M2111</guid>
      <dc:creator>Tim_Day</dc:creator>
      <dc:date>2009-02-12T09:46:17Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856707#M2112</link>
      <description>What you have observed may have to do with data alignment and with the ability of the compiler to determine the copy size at compile time. I can see that your 64-bit code uses MOVDQU (unaligned move) while 32-bit uses MOVDQA.&lt;BR /&gt;&lt;BR /&gt;If you are making such a large allocation (512MB) it would be wise to use OS memory allocation API (VirtualAlloc() in particular) which returns page-aligned memory pointer. If you have enough memory and your application is "alone" in the system you may as well use VirtualLock() to prevent paging of source and destination buffers but for that you will have to increase process working set size. Compiling with /Qopt-prefetch or using TLB priming in advance may help performance as well.&lt;BR /&gt;&lt;BR /&gt;Finally don't forget to issue the SFENCE instruction after the copy if you are using non-temporal stores, at least the software developer's manual suggests that.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 13 Feb 2009 19:08:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856707#M2112</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2009-02-13T19:08:27Z</dc:date>
    </item>
    <item>
      <title>Re: Performance difference between 32bit and 64bit memcpy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856708#M2113</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/410300"&gt;Tim Day&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;We have Core2 machines (Dell T5400) with XP64.&lt;BR /&gt;&lt;BR /&gt;We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).&lt;BR /&gt;&lt;BR /&gt;While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves&lt;BR /&gt;&lt;BR /&gt;I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what &lt;BR /&gt;&lt;BR /&gt;Thanks for any insight.&lt;BR /&gt;Tim&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;its probably caused by the cache alignment and how many reads/writes that can happen within a cache boundry&lt;BR /&gt;32 bit processes do not have to jump through hoops the addresses are padded to 64 bit addresses&lt;BR /&gt;</description>
      <pubDate>Wed, 25 Feb 2009 08:03:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-difference-between-32bit-and-64bit-memcpy/m-p/856708#M2113</guid>
      <dc:creator>delacy__david</dc:creator>
      <dc:date>2009-02-25T08:03:05Z</dc:date>
    </item>
  </channel>
</rss>

