Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
88 Views

Intel ___intel_memcpy slower than platform memcpy

I recently noticed that Intel icpc compiled program is noticeably slower when memory operations are intensive. So did some benchmark on memcpy. It appears that __intel_memcpy is slower than the platform memcpy. Below is some details and results of the benchmark. The operating system is Mac OS X Yosemite (10.10), and all tools are up-to-date. The CPU is a Haswell one and details information of the CPU is also pasted below. The test program use some utilities in my own library. Mostly two classes that count cycles and measure time intervals. For each size of memory chunk, the memcpy call is repeated such that in total 20GB was copied, and thus the cycle count and time measurement are accurate enough.

 

I have also run the test program through a profiler, and it appears that the icpc compiled test program calls __intel_memcpy instead of __intel_fast_memcpy or __intel_new_memcpy. I image __intel_fast_memcpy shall be faster. When will it be called? Is there some compiler option I need to enable it?

In addition, if I am not mistaken, on this CPU the theoretical peak cpB shall be about 0.03125 (32 bytes per cycles), when data can fit into cache. It appears that when using clang, for 2KB - 16KB buffer, the performance is note far off from this value. In contrast, the Intel compiler is never close to this one. Instead it appears to peak out at 0.06, which is about (16 bytes per cycle).

I would very much appreciate if any one can explain to me why intel's supposed optimized memcpy is slower. Many thanks in advance.

$ cpuid_info # a small program I wrote to query basic flags in cpuid
==========================================================================================
Vendor ID                  GenuineIntel
Processor brand            Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
==========================================================================================
Deterministic cache parameters
------------------------------------------------------------------------------------------
Cache level                           1           1           2           3
Cache type                         Data Instruction     Unified     Unified
Cache size (byte)                   32K         32K        256K          6M
Maximum Proc sharing                  2           2           2          16
Maximum Proc physical                 8           8           8           8
Coherency line size (byte)           64          64          64          64
Physical line partitions              1           1           1           1
Ways of associative                   8           8           8          12
Number of sets                       64          64         512        8192
Self initializing                   Yes         Yes         Yes         Yes
Fully associative                    No          No          No          No
Write-back invalidate                No          No          No          No
Cache inclusiveness                  No          No          No         Yes
Complex cache indexing               No          No          No         Yes
==========================================================================================
Processor info and features
------------------------------------------------------------------------------------------
ACPI           AES            APIC           AVX            CLFSH          CMOV           
CX16           CX8            DE             DS             DS_CPL         DTES64         
EST            F16C           FMA            FPU            FXSR           HTT            
MCA            MCE            MMX            MONITOR        MOVBE          MSR            
MTRR           OSXSAVE        PAE            PAT            PBE            PCID           
PCLMULQDQ      PDCM           PGE            POPCNT         PSE            PSE_36         
RDRAND         SEP            SMX            SS             SSE            SSE2           
SSE3           SSE4_1         SSE4_2         SSSE3          TM             TM2            
TSC            TSC_DEADLINE   VME            VMX            X2APIC         XSAVE          
XTPR           
==========================================================================================
Extended features
------------------------------------------------------------------------------------------
AVX2           BMI1           BMI2           ERMS           FSGSBASE       HLE            
INVPCID        RTM            SMEP           
==========================================================================================
Extended processor info and features
------------------------------------------------------------------------------------------
ABM            GBPAGES        LAHF_LM        LM             NX             RDTSCP         
SYSCALL        
==========================================================================================
$ icpc --version
icpc (ICC) 15.0.0 20140716
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
$ clang --version
Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.0.0
Thread model: posix
$ cat test.cpp 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main (int argc, char **argv)
{
    std::mt19937_64 eng;
    std::uniform_real_distribution runif(0, 1);

    const std::size_t NMax = 1024U * 1024U * 256U;

    // vectors with memory aligned as 32 bytes
    std::vector> x(NMax);
    std::vector> y(NMax);
    for (std::size_t i = 0; i != NMax; ++i)
        x = runif(eng);

    std::vector<:SIZE_T> bytes;
    std::vector cpB;
    std::vector GBs;
    
    std::size_t N = NMax;
    while (N > 0) {
        // R: Number of repeats
        // The total size of memory copied (R * N * sizeof(double)) will be
        // about 20GB
        std::size_t R = NMax / N * 10;
        std::size_t B = N * sizeof(double); // bytes
        vsmc::RDTSCPCounter counter; // A class to count cycles using RDTSCP
        vsmc::StopWatch watch;       // A class to measure time

        watch.start();
        counter.start();
        for (std::size_t r = 0; r != R; ++r) {
            memcpy(y.data(), x.data(), B);
            x.front() += 1.0;
            // prevent compiler to be too smart to see that the loop does
            // nothing
        }
        counter.stop();
        watch.stop();

        double dbytes = static_cast(B * R);
        bytes.push_back(B);
        cpB.push_back(counter.cycles() / dbytes);
        GBs.push_back(dbytes / watch.nanoseconds());

        N /= 2;
    }

    for (std::size_t i = 0; i != bytes.size(); ++i) {
        if (bytes >= 1024 * 1024 * 1024)
            std::cout << bytes / (1024.0 * 1024 * 1024) << "GB\t";
        else if (bytes >= 1024 * 1024)
            std::cout << bytes / (1024.0 * 1024) << "MB\t";
        else if (bytes >= 1024)
            std::cout << bytes / 1024.0 << "KB\t";
        else
            std::cout << bytes << "B: ";
        std::cout << cpB << "cpB" << '\t';
        std::cout << GBs << "GB/s" << '\n';
    }

    // Output y.front(), again prevent too clever compiler to see that all
    // those memcpy is for nothing
    std::ofstream dummy("dummy_file");
    dummy << y.front();
    dummy.close();

    return 0;
}
$ clang++ -std=c++11 -march=native -mavx2 -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
                 U _memcpy
2GB    0.416656cpB    7.66246GB/s
1GB    0.385954cpB    8.27198GB/s
512MB    0.387439cpB    8.24028GB/s
256MB    0.382988cpB    8.33605GB/s
128MB    0.383861cpB    8.3171GB/s
64MB    0.382132cpB    8.35473GB/s
32MB    0.385583cpB    8.27994GB/s
16MB    0.381789cpB    8.36224GB/s
8MB    0.333294cpB    9.57894GB/s
4MB    0.21869cpB    14.5988GB/s
2MB    0.157544cpB    20.2649GB/s
1MB    0.151573cpB    21.0631GB/s
512KB    0.150837cpB    21.1659GB/s
256KB    0.135304cpB    23.5957GB/s
128KB    0.113332cpB    28.1704GB/s
64KB    0.11268cpB    28.3335GB/s
32KB    0.113212cpB    28.2001GB/s
16KB    0.0344147cpB    92.7684GB/s
8KB    0.031262cpB    102.124GB/s
4KB    0.0334106cpB    95.5566GB/s
2KB    0.0388868cpB    82.0998GB/s
1KB    0.0440008cpB    72.5579GB/s
512B: 0.0601694cpB    53.0602GB/s
256B: 0.0923376cpB    34.5754GB/s
128B: 0.156117cpB    20.4501GB/s
64B: 0.328138cpB    9.72948GB/s
32B: 0.513387cpB    6.21871GB/s
16B: 0.617065cpB    5.17386GB/s
8B: 1.32244cpB    2.41418GB/s
$ icpc -std=c++11 -xHost -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
00000001000087c0 T ___intel_memcpy
00000001000087c0 T ___intel_new_memcpy
0000000100005480 T __intel_fast_memcpy
2GB    0.618551cpB    5.16144GB/s
1GB    0.562556cpB    5.67518GB/s
512MB    0.552512cpB    5.77835GB/s
256MB    0.531465cpB    6.00719GB/s
128MB    0.537667cpB    5.93789GB/s
64MB    0.537064cpB    5.94456GB/s
32MB    0.534799cpB    5.96974GB/s
16MB    0.530338cpB    6.01995GB/s
8MB    0.593578cpB    5.37859GB/s
4MB    0.364334cpB    8.76286GB/s
2MB    0.206092cpB    15.4912GB/s
1MB    0.187371cpB    17.039GB/s
512KB    0.183633cpB    17.3858GB/s
256KB    0.184216cpB    17.3308GB/s
128KB    0.114155cpB    27.9673GB/s
64KB    0.113653cpB    28.0908GB/s
32KB    0.114386cpB    27.9109GB/s
16KB    0.0781414cpB    40.8568GB/s
8KB    0.060229cpB    53.0078GB/s
4KB    0.0622433cpB    51.2924GB/s
2KB    0.0675664cpB    47.2514GB/s
1KB    0.078558cpB    40.6401GB/s
512B: 0.0994097cpB    32.1157GB/s
256B: 0.142734cpB    22.3676GB/s
128B: 0.179139cpB    17.822GB/s
64B: 0.239995cpB    13.3028GB/s
32B: 0.443505cpB    7.19858GB/s
16B: 0.878139cpB    3.63566GB/s
8B: 1.54955cpB    2.06035GB/s
0 Kudos
13 Replies
Highlighted
Beginner
88 Views

P.S., It seems I formatted the verbatim text wrongly, everything between a pair "<" and ">" was not shown. I hope it does not make what I was trying to do much more difficult to understand. I don't find an option to edit my original post

Basically, the x and y vector in the program are of type double.

0 Kudos
Highlighted
Black Belt
88 Views

The automatic switching to nontemporal stores inside the memcpy functions may be impacting the issue.  Switching may occur at too small a B value if you are testing by over-writing the same data repeatedly, or at too large a B value in other cases.  The internal thresholds of some of them try to take account of the possibility of calling in a parallel region.

Your version of icpc may also make a difference.

In order to gain control over choice of nontemporal, you would need a comparison case with a for() loop using #pragma simd (to stop automatic memcpy substitution) and (optionally) #pragma vector nontemporal

I was informed in the past that the choice among the icpc library versions was made at run time, according to the characteristics of the arrays. For consistent results, you should specify explicitly the alignments you want.

0 Kudos
Highlighted
Black Belt
88 Views

sorry, mis-placed post, no longer have delete privilege

0 Kudos
Highlighted
Beginner
88 Views

Many thanks for the comments. However I don't think non temporal was the problem here. I tried #pragma simd and with icpc it gives close to 100GB/s performance for 8KB, which is the performance of the system memcpy at this size.

I also tried vectorization in other compilers (clang 3.5, gcc 4.9), the results are that for larger arrays (64MB and up), their performance are no longer ~8GB/s but close to 5.6GB/s, which is the __intel_memcpy performance. For smaller arrays, the performance is similar to that using system memcpy. And clang actually significantly improved the performance for small arrays (< 1KB).

All std::vector involves are allocated with a custom allocator, which explicitly align on 32 bytes boundaries (I formatted the code in the original post wrongly and the allocator, among other things between angle brackets were not shown.)

So I did some more profiling of the test program. And instead of using a test program with many array sizes, I tested one size at a time (so I can see exactly what instructions were executed for any specific size.)

First, in any case, no NTD instructions (MOVNTDQA etc) were executed. Almost all time were spend on a few cached write instructions. Below is a list of those:

With clang/gcc, which link to the system memcpy (actually on OS X the memcpy was replaced by platform_memmove, but for non-overlapping case, it shall be just as fast. The overhead shall be at most a jump that tests if there is overlapping),  most time were spent on the AVX instruction VMOVUPS, the unaligned move.

With icpc, which use the __intel_memcpy, most time were spent on MOVDQA, the SSE2 aligned move. 15% percent time are MOVDQA from memory into XMM registers and 70% time are MOVDQA from XMM to memory. 15% were not accounted for. The profiler is OS X instrument. My guess is that it has low overhead (I notice little difference between profited and non-profiled tests), but not very accurate. (or maybe it uses two ports for load and the profiler only count time and thus only see one?)

When vectorized, for 8KB buffer, the one seems to be fastest, all compilers generate AVX instructions. clang generate VMOVUPS while icpc generate VMOVUPD, which I believe make little difference here. 

It appears to me that the optimized performance (at least for moderate size, 1KB - 1MB), were obtained by using AVX instructions. I was under the impression that in AVX, aligned and unaligned ops have almost the same performance. However, when using __intel_memcpy, no AVX execution path were selected. Instead, only SSE2 path were used, even though using the same compiler flags, the compiler itself can generate AVX instructions. Is this the expected behavior? I also tried the same tests on Linux, and the same results were obtained.

I dissembled the libirc.a, which I believe contains the__intel_memcpy function. I only glanced through the code. But it seems there is only SSE2 and AVX512F code path. I don't see only AVX execution path there.

0 Kudos
Highlighted
Black Belt
88 Views

gcc, (and, I suppose clang), don't offer nontemporal code in for() loops except by intrinsics.  icc controls it (once automatic memcpy substitution is blocked) by the opt-streaming-store options (default auto, meaning the compiler hopes to guess the best choice), in the absence of #pragma vector nontemporal.  For double data, movntpd instructions seem most likely.  movdqu might indicate a misalignment was found, but apparently you didn't have that problem.

AVX code uses unaligned instructions even when alignment is expected, as it should not lose performance (but actual unaligned 256-bit data give very poor performance on Sandy Bridge).  On Sandy or Ivy Bridge, SSE, AVX-128, and AVX-256 aligned may all give similar performance.  The compile options don't influence the memcpy library functions, so they may use SSE or SSE2 to avoid checking whether AVX is available (but would need to wind up with vzeroupper if running on AVX).

When linux glibc memcpy was first optimized for Opteron, it added 64-bit non-temporal moves for the case of 8-byte aligned data, as those were faster on platforms which split 128-bit moves in hardware.  I don't know if it's still true that the glibc memcpy would choose different instructions from the _intel_ ones.

0 Kudos
Highlighted
88 Views

Thanks for your post, I'd like to try your test case, could you list what the includes were and anything else that was lost between the "<" and ">" in your code?

Thanks,
Richard

0 Kudos
Highlighted
Beginner
88 Views

Thanks for the replay. I attached the complete source of the tests.

The tests are in memcpy.cpp and memset.cpp. After this posting, out of curiosity, I implemented my own versions of these functions. The tests now compares my implementation against system/compiler provided functions. In the folder, the file test.txt is the results from one of my MacBook Pro.

The tests simply copy/set buffers repeatedly. The columns in the output are

Size: Size of the buffer

Size.H: Size in human readable format

Offset: The offset of the destination buffer. The buffers are allocated to be aligned at 32 bytes and thus for non-zero offset, the test case are for non-aligned buffers.

The next are the cpB, GBs for my library (vSMC) and compiler provided functions (System). The Verify column is simply test that my implementation is correct and the Speedup column is my library against system, by comparing GBs (I did not bother to set thread affinity and thus the cpB number might be meaningless. The cycle counter use RDTSCP and will return zero if the AUX value is different between the start and end).

Along with the source memset.cpp and memcpy.cpp are the library used. I removed most of the headers that are irrelevant here.

The intel compiler provided memcpy and memset can be twice or even more slower than my AVX aware implementation. This is most noticeable when 1) The buffer is very large, and non-temporal stores are used in my library; 2) The buffer is small (<16KB), when the data can be fit into L1D.

Of course, no real program will repeatedly copy the same buffer. But moving large buffers are sometimes unavoidable and it is also possible that small buffers are copied repeatedly with some operation on them between each copy/set, while the whole loop still leave the buffer in L1D or L2. 

0 Kudos
Highlighted
88 Views

Thanks for the test case and the additional information, I wasn't able to compile because the project requires boost.  Can you send me the .i file instead?  There is an article here that explains how to generate one:  https://software.intel.com/en-us/articles/how-to-create-a-preprocessed-file-i-file/

Thank you,
Richard

0 Kudos
Highlighted
Beginner
88 Views

Thanks for the reply. I have added some compiler flags such that the binaries can be compiled with nothing but the standard library in C++11 mode, provided recent enough GCC (4.7 or later) or Clang (on OS X). I don't know anything about Windows. I also added the *.i files in the ZIP.

0 Kudos
Highlighted
88 Views

I tried the new code you provided and saw similar results on my system so I sent this issue to engineering.  Since there is an issue with both memcpy and memset I filed 2 tickets, the memset ticket number is: DPD200363398 and the memcpy ticket number is: DPD200363373.  Also, you asked in your first post about how access __intel_fast_memcpy.  This can not be accessed directly, but is used by the compiler when the necessary conditions are met.  There is an article about usage of memcpy and memset here: https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control

Thank you,
Richard 

0 Kudos
Highlighted
Beginner
88 Views

Thanks for the info. I figured out the __intel_fast_memcpy problem myself by examining the disassembled program. It seems that it simply called __intel_memcpy and that's why I did not see it in the profiler.

Also, though I only included memset and memcpy in the test cases, I am wondering why there is not an __intel_memmove. The symbol can be found in libirc but it was never generated by the compiler.

0 Kudos
Highlighted
88 Views

Quick update for you, engineering is continuing work on the issues mentioned above.  As for your question on  __intel_memmove, like memcpy and memset, this may not show in your code for various reasons.  When speaking with our developers I was told that we have been generating optimized versions of memmove back at least as far as the 14.0 compiler.  There are a number of situations that might prevent us from calling the optimized implementation.

Here are a couple of examples:

1) We may inline the call under some conditions (which would result in the improved performance without the function being visibly called)
2) Some compiler flags will prevent us from calling the optimized implementation (e.g. -ffreestanding or -fno-builtin)

Hope that helps!

Thank you,
Richard

 

0 Kudos
Highlighted
Moderator
88 Views

An update to the memcpy slowness issue on OSX, it is fixed now. the fix is in the latest 15.0 update 6.

Thanks,

Jennifer

0 Kudos