- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recently noticed that Intel icpc compiled program is noticeably slower when memory operations are intensive. So did some benchmark on memcpy. It appears that __intel_memcpy is slower than the platform memcpy. Below is some details and results of the benchmark. The operating system is Mac OS X Yosemite (10.10), and all tools are up-to-date. The CPU is a Haswell one and details information of the CPU is also pasted below. The test program use some utilities in my own library. Mostly two classes that count cycles and measure time intervals. For each size of memory chunk, the memcpy call is repeated such that in total 20GB was copied, and thus the cycle count and time measurement are accurate enough.
I have also run the test program through a profiler, and it appears that the icpc compiled test program calls __intel_memcpy instead of __intel_fast_memcpy or __intel_new_memcpy. I image __intel_fast_memcpy shall be faster. When will it be called? Is there some compiler option I need to enable it?
In addition, if I am not mistaken, on this CPU the theoretical peak cpB shall be about 0.03125 (32 bytes per cycles), when data can fit into cache. It appears that when using clang, for 2KB - 16KB buffer, the performance is note far off from this value. In contrast, the Intel compiler is never close to this one. Instead it appears to peak out at 0.06, which is about (16 bytes per cycle).
I would very much appreciate if any one can explain to me why intel's supposed optimized memcpy is slower. Many thanks in advance.
$ cpuid_info # a small program I wrote to query basic flags in cpuid
========================================================================================== Vendor ID GenuineIntel Processor brand Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz ========================================================================================== Deterministic cache parameters ------------------------------------------------------------------------------------------ Cache level 1 1 2 3 Cache type Data Instruction Unified Unified Cache size (byte) 32K 32K 256K 6M Maximum Proc sharing 2 2 2 16 Maximum Proc physical 8 8 8 8 Coherency line size (byte) 64 64 64 64 Physical line partitions 1 1 1 1 Ways of associative 8 8 8 12 Number of sets 64 64 512 8192 Self initializing Yes Yes Yes Yes Fully associative No No No No Write-back invalidate No No No No Cache inclusiveness No No No Yes Complex cache indexing No No No Yes ========================================================================================== Processor info and features ------------------------------------------------------------------------------------------ ACPI AES APIC AVX CLFSH CMOV CX16 CX8 DE DS DS_CPL DTES64 EST F16C FMA FPU FXSR HTT MCA MCE MMX MONITOR MOVBE MSR MTRR OSXSAVE PAE PAT PBE PCID PCLMULQDQ PDCM PGE POPCNT PSE PSE_36 RDRAND SEP SMX SS SSE SSE2 SSE3 SSE4_1 SSE4_2 SSSE3 TM TM2 TSC TSC_DEADLINE VME VMX X2APIC XSAVE XTPR ========================================================================================== Extended features ------------------------------------------------------------------------------------------ AVX2 BMI1 BMI2 ERMS FSGSBASE HLE INVPCID RTM SMEP ========================================================================================== Extended processor info and features ------------------------------------------------------------------------------------------ ABM GBPAGES LAHF_LM LM NX RDTSCP SYSCALL ==========================================================================================
$ icpc --version
icpc (ICC) 15.0.0 20140716 Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
$ clang --version
Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn) Target: x86_64-apple-darwin14.0.0 Thread model: posix
$ cat test.cpp
#include#include #include #include #include #include #include int main (int argc, char **argv) { std::mt19937_64 eng; std::uniform_real_distribution runif(0, 1); const std::size_t NMax = 1024U * 1024U * 256U; // vectors with memory aligned as 32 bytes std::vector > x(NMax); std::vector > y(NMax); for (std::size_t i = 0; i != NMax; ++i) x = runif(eng); std::vector<:SIZE_T> bytes; std::vector cpB; std::vector GBs; std::size_t N = NMax; while (N > 0) { // R: Number of repeats // The total size of memory copied (R * N * sizeof(double)) will be // about 20GB std::size_t R = NMax / N * 10; std::size_t B = N * sizeof(double); // bytes vsmc::RDTSCPCounter counter; // A class to count cycles using RDTSCP vsmc::StopWatch watch; // A class to measure time watch.start(); counter.start(); for (std::size_t r = 0; r != R; ++r) { memcpy(y.data(), x.data(), B); x.front() += 1.0; // prevent compiler to be too smart to see that the loop does // nothing } counter.stop(); watch.stop(); double dbytes = static_cast (B * R); bytes.push_back(B); cpB.push_back(counter.cycles() / dbytes); GBs.push_back(dbytes / watch.nanoseconds()); N /= 2; } for (std::size_t i = 0; i != bytes.size(); ++i) { if (bytes >= 1024 * 1024 * 1024) std::cout << bytes / (1024.0 * 1024 * 1024) << "GB\t"; else if (bytes >= 1024 * 1024) std::cout << bytes / (1024.0 * 1024) << "MB\t"; else if (bytes >= 1024) std::cout << bytes / 1024.0 << "KB\t"; else std::cout << bytes << "B: "; std::cout << cpB << "cpB" << '\t'; std::cout << GBs << "GB/s" << '\n'; } // Output y.front(), again prevent too clever compiler to see that all // those memcpy is for nothing std::ofstream dummy("dummy_file"); dummy << y.front(); dummy.close(); return 0; }
$ clang++ -std=c++11 -march=native -mavx2 -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
U _memcpy 2GB 0.416656cpB 7.66246GB/s 1GB 0.385954cpB 8.27198GB/s 512MB 0.387439cpB 8.24028GB/s 256MB 0.382988cpB 8.33605GB/s 128MB 0.383861cpB 8.3171GB/s 64MB 0.382132cpB 8.35473GB/s 32MB 0.385583cpB 8.27994GB/s 16MB 0.381789cpB 8.36224GB/s 8MB 0.333294cpB 9.57894GB/s 4MB 0.21869cpB 14.5988GB/s 2MB 0.157544cpB 20.2649GB/s 1MB 0.151573cpB 21.0631GB/s 512KB 0.150837cpB 21.1659GB/s 256KB 0.135304cpB 23.5957GB/s 128KB 0.113332cpB 28.1704GB/s 64KB 0.11268cpB 28.3335GB/s 32KB 0.113212cpB 28.2001GB/s 16KB 0.0344147cpB 92.7684GB/s 8KB 0.031262cpB 102.124GB/s 4KB 0.0334106cpB 95.5566GB/s 2KB 0.0388868cpB 82.0998GB/s 1KB 0.0440008cpB 72.5579GB/s 512B: 0.0601694cpB 53.0602GB/s 256B: 0.0923376cpB 34.5754GB/s 128B: 0.156117cpB 20.4501GB/s 64B: 0.328138cpB 9.72948GB/s 32B: 0.513387cpB 6.21871GB/s 16B: 0.617065cpB 5.17386GB/s 8B: 1.32244cpB 2.41418GB/s
$ icpc -std=c++11 -xHost -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
00000001000087c0 T ___intel_memcpy 00000001000087c0 T ___intel_new_memcpy 0000000100005480 T __intel_fast_memcpy 2GB 0.618551cpB 5.16144GB/s 1GB 0.562556cpB 5.67518GB/s 512MB 0.552512cpB 5.77835GB/s 256MB 0.531465cpB 6.00719GB/s 128MB 0.537667cpB 5.93789GB/s 64MB 0.537064cpB 5.94456GB/s 32MB 0.534799cpB 5.96974GB/s 16MB 0.530338cpB 6.01995GB/s 8MB 0.593578cpB 5.37859GB/s 4MB 0.364334cpB 8.76286GB/s 2MB 0.206092cpB 15.4912GB/s 1MB 0.187371cpB 17.039GB/s 512KB 0.183633cpB 17.3858GB/s 256KB 0.184216cpB 17.3308GB/s 128KB 0.114155cpB 27.9673GB/s 64KB 0.113653cpB 28.0908GB/s 32KB 0.114386cpB 27.9109GB/s 16KB 0.0781414cpB 40.8568GB/s 8KB 0.060229cpB 53.0078GB/s 4KB 0.0622433cpB 51.2924GB/s 2KB 0.0675664cpB 47.2514GB/s 1KB 0.078558cpB 40.6401GB/s 512B: 0.0994097cpB 32.1157GB/s 256B: 0.142734cpB 22.3676GB/s 128B: 0.179139cpB 17.822GB/s 64B: 0.239995cpB 13.3028GB/s 32B: 0.443505cpB 7.19858GB/s 16B: 0.878139cpB 3.63566GB/s 8B: 1.54955cpB 2.06035GB/s
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
P.S., It seems I formatted the verbatim text wrongly, everything between a pair "<" and ">" was not shown. I hope it does not make what I was trying to do much more difficult to understand. I don't find an option to edit my original post
Basically, the x and y vector in the program are of type double.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The automatic switching to nontemporal stores inside the memcpy functions may be impacting the issue. Switching may occur at too small a B value if you are testing by over-writing the same data repeatedly, or at too large a B value in other cases. The internal thresholds of some of them try to take account of the possibility of calling in a parallel region.
Your version of icpc may also make a difference.
In order to gain control over choice of nontemporal, you would need a comparison case with a for() loop using #pragma simd (to stop automatic memcpy substitution) and (optionally) #pragma vector nontemporal
I was informed in the past that the choice among the icpc library versions was made at run time, according to the characteristics of the arrays. For consistent results, you should specify explicitly the alignments you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry, mis-placed post, no longer have delete privilege
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many thanks for the comments. However I don't think non temporal was the problem here. I tried #pragma simd and with icpc it gives close to 100GB/s performance for 8KB, which is the performance of the system memcpy at this size.
I also tried vectorization in other compilers (clang 3.5, gcc 4.9), the results are that for larger arrays (64MB and up), their performance are no longer ~8GB/s but close to 5.6GB/s, which is the __intel_memcpy performance. For smaller arrays, the performance is similar to that using system memcpy. And clang actually significantly improved the performance for small arrays (< 1KB).
All std::vector involves are allocated with a custom allocator, which explicitly align on 32 bytes boundaries (I formatted the code in the original post wrongly and the allocator, among other things between angle brackets were not shown.)
So I did some more profiling of the test program. And instead of using a test program with many array sizes, I tested one size at a time (so I can see exactly what instructions were executed for any specific size.)
First, in any case, no NTD instructions (MOVNTDQA etc) were executed. Almost all time were spend on a few cached write instructions. Below is a list of those:
With clang/gcc, which link to the system memcpy (actually on OS X the memcpy was replaced by platform_memmove, but for non-overlapping case, it shall be just as fast. The overhead shall be at most a jump that tests if there is overlapping), most time were spent on the AVX instruction VMOVUPS, the unaligned move.
With icpc, which use the __intel_memcpy, most time were spent on MOVDQA, the SSE2 aligned move. 15% percent time are MOVDQA from memory into XMM registers and 70% time are MOVDQA from XMM to memory. 15% were not accounted for. The profiler is OS X instrument. My guess is that it has low overhead (I notice little difference between profited and non-profiled tests), but not very accurate. (or maybe it uses two ports for load and the profiler only count time and thus only see one?)
When vectorized, for 8KB buffer, the one seems to be fastest, all compilers generate AVX instructions. clang generate VMOVUPS while icpc generate VMOVUPD, which I believe make little difference here.
It appears to me that the optimized performance (at least for moderate size, 1KB - 1MB), were obtained by using AVX instructions. I was under the impression that in AVX, aligned and unaligned ops have almost the same performance. However, when using __intel_memcpy, no AVX execution path were selected. Instead, only SSE2 path were used, even though using the same compiler flags, the compiler itself can generate AVX instructions. Is this the expected behavior? I also tried the same tests on Linux, and the same results were obtained.
I dissembled the libirc.a, which I believe contains the__intel_memcpy function. I only glanced through the code. But it seems there is only SSE2 and AVX512F code path. I don't see only AVX execution path there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
gcc, (and, I suppose clang), don't offer nontemporal code in for() loops except by intrinsics. icc controls it (once automatic memcpy substitution is blocked) by the opt-streaming-store options (default auto, meaning the compiler hopes to guess the best choice), in the absence of #pragma vector nontemporal. For double data,
AVX code uses unaligned instructions even when alignment is expected, as it should not lose performance (but actual unaligned 256-bit data give very poor performance on Sandy Bridge). On Sandy or Ivy Bridge, SSE, AVX-128, and AVX-256 aligned may all give similar performance. The compile options don't influence the memcpy library functions, so they may use SSE or SSE2 to avoid checking whether AVX is available (but would need to wind up with vzeroupper if running on AVX).
When linux glibc memcpy was first optimized for Opteron, it added 64-bit non-temporal moves for the case of 8-byte aligned data, as those were faster on platforms which split 128-bit moves in hardware. I don't know if it's still true that the glibc memcpy would choose different instructions from the _intel_ ones.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your post, I'd like to try your test case, could you list what the includes were and anything else that was lost between the "<" and ">" in your code?
Thanks,
Richard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the replay. I attached the complete source of the tests.
The tests are in memcpy.cpp and memset.cpp. After this posting, out of curiosity, I implemented my own versions of these functions. The tests now compares my implementation against system/compiler provided functions. In the folder, the file test.txt is the results from one of my MacBook Pro.
The tests simply copy/set buffers repeatedly. The columns in the output are
Size: Size of the buffer
Size.H: Size in human readable format
Offset: The offset of the destination buffer. The buffers are allocated to be aligned at 32 bytes and thus for non-zero offset, the test case are for non-aligned buffers.
The next are the cpB, GBs for my library (vSMC) and compiler provided functions (System). The Verify column is simply test that my implementation is correct and the Speedup column is my library against system, by comparing GBs (I did not bother to set thread affinity and thus the cpB number might be meaningless. The cycle counter use RDTSCP and will return zero if the AUX value is different between the start and end).
Along with the source memset.cpp and memcpy.cpp are the library used. I removed most of the headers that are irrelevant here.
The intel compiler provided memcpy and memset can be twice or even more slower than my AVX aware implementation. This is most noticeable when 1) The buffer is very large, and non-temporal stores are used in my library; 2) The buffer is small (<16KB), when the data can be fit into L1D.
Of course, no real program will repeatedly copy the same buffer. But moving large buffers are sometimes unavoidable and it is also possible that small buffers are copied repeatedly with some operation on them between each copy/set, while the whole loop still leave the buffer in L1D or L2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the test case and the additional information, I wasn't able to compile because the project requires boost. Can you send me the .i file instead? There is an article here that explains how to generate one: https://software.intel.com/en-us/articles/how-to-create-a-preprocessed-file-i-file/
Thank you,
Richard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the new code you provided and saw similar results on my system so I sent this issue to engineering. Since there is an issue with both memcpy and memset I filed 2 tickets, the memset ticket number is: DPD200363398 and the memcpy ticket number is: DPD200363373. Also, you asked in your first post about how access __intel_fast_memcpy. This can not be accessed directly, but is used by the compiler when the necessary conditions are met. There is an article about usage of memcpy and memset here: https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control
Thank you,
Richard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the info. I figured out the __intel_fast_memcpy problem myself by examining the disassembled program. It seems that it simply called __intel_memcpy and that's why I did not see it in the profiler.
Also, though I only included memset and memcpy in the test cases, I am wondering why there is not an __intel_memmove. The symbol can be found in libirc but it was never generated by the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quick update for you, engineering is continuing work on the issues mentioned above. As for your question on __intel_memmove, like memcpy and memset, this may not show in your code for various reasons. When speaking with our developers I was told that we have been generating optimized versions of memmove back at least as far as the 14.0 compiler. There are a number of situations that might prevent us from calling the optimized implementation.
Here are a couple of examples:
1) We may inline the call under some conditions (which would result in the improved performance without the function being visibly called)
2) Some compiler flags will prevent us from calling the optimized implementation (e.g. -ffreestanding or -fno-builtin)
Hope that helps!
Thank you,
Richard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An update to the memcpy slowness issue on OSX, it is fixed now. the fix is in the latest 15.0 update 6.
Thanks,
Jennifer

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page