- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm optimizing an application using ICC. Vtune hotspot analysis shows that some wrapper functions for the C standard memcpy() function are among the functions with higher CPU time. That's quite unexpected as they only copy small (250-500B) user-defined data structures. The assembly shows a call to _intel_fast_memcpy and a red bar that marks the function poor performance.
In some other cases the memcpy wrapper functions are replaced with some mmx instructions, proving that the SSE4.1 support is enabled.
I was wondering if there are ways to make the poor performing memcpy wrappers more efficient.
I have tried to align the source and destination pointers to 16 bytes through __attribute__((aligned(16))), but that didn't help.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case your strings aren't long enough to amortize the overhead of calling a library memcpy function, you have the option of using a simple for() loop with #pragma simd or #pragma omp simd aligned(...) to suppress the function call and request in-lined simd code. I hope Georg didn't have this in mind when he made the comment about less portable. The omp simd will require the -openmp-simd compile option if using the 13.1 compiler.
There are some new interprocedural optimizations for intel_fast_memcpy in the 14.0 compiler, but I don't know which situations they would help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm getting a significant improvement using the __assume_aligned_(ptr,n) when n=64, not 16 as I would expect. Any guess why that happens?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Several recent Intel CPUs get a significant benefit for 32-byte data alignment over 16-byte alignment, even though the latter are sufficient for aligned SSE instruction execution. Asserting 32- or 64-byte alignment would help the compiler when generating AVX-256 instructions, if the data are actually at least 32-byte aligned.
You may wish to compare opt-report and asm file output to see what use the compiler may be making of your alignment assertions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Georg Zitzlsberger (Intel) wrote:
However, in cases where (multi-version) functions with such tests are called very frequently, and their actual work is very small, those tests can become noticeable.
I think that's the case of my application.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I oonly suppose that Intel implementation of memcpy on latest architecture (starting from SB) can exploit two load ports.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
I oonly suppose that Intel implementation of memcpy on latest architecture (starting from SB) can exploit two load ports.
memcpy doesn't normally benefit from the 2 load ports on Sandy or Ivy Bridge on account of the single store port. Presumably, Haswell could change this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually on Haswellit could be two loads and one store per cycle.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just wanted to say that creating a multi-threaded memcpy yields a significant performance increase if you're usually copying large memory blocks, it may be that it's sufficient to use the "/Qparallel" compiler flag though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Inge H. wrote:
I just wanted to say that creating a multi-threaded memcpy yields a significant performance increase if you're usually copying large memory blocks, it may be that it's sufficient to use the "/Qparallel" compiler flag though.
With OpenMP, in the 14.0 compiler, a parallel for such as
#pragma omp parallel for simd
for(int i = 0; i < N; ++i) a = b
may be more effective than a memcpy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can also add vectorization and loop unrolling to copy large memory blocks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>may be more effective than a memcpy.>>>
Probably by the vectorization and loop unrolling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>may be more effective than a memcpy.>>>
Probably by the vectorization and loop unrolling.
In view of saving the time of a library call and not needing to check for cases which aren't aligned to the data type. Optional aligned clause (supposedly same effect as __assume_aligned) would enable in-line code generation to skip all alignment checks.
When forcing in-line code by means such as this, the check internal to fast_memcpy for a size big enough to invoke nontemporal stores is replaced by compile-time decision according to the streaming-stores option. As OP mentioned strings up to 500 bytes, run-time check on length for nontemporal would simply waste time.
In favor of memcpy, on certain Intel CPUs, those substitutions improved code locality and reduced instruction TLB misses, when compared with many separate code expansions of individual for loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As there is no faster way for memcpy to implement a copy where SSE moves work, you wouldn't expect memcpy to perform any magic other than switching to nontemporal for very long moves. I don't know whether these memcpy functions would gain benefit from recognizing Haswell and switching to AVX-256.
In a case like the STREAM COPY benchmark, you may improve performance by using an omp parallel for simd loop (preventing memcpy substitution) and allowing -opt-streaming-stores to kick in.
MKL BLAS offers ?copy functions which could be used to avoid the overhead of checking for odd-byte alignment. These automatically invoke OpenMP threading if you so choose by your combination of link options and run-time environment setting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>movaps xmm0, xmmword ptr [ esi ]>>>
You can use more registers(at maximum two registers will be used per single cycle) to utilize two load Ports. For example:
movaps xmm1,xmmword ptr [esi+16]
movaps xmm2,xmmword ptr [esi+32]
and so on.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page