- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to optimize the standard memcpy() to use SSE2. However, my tests show that there is little/no difference between the system memcpy(), my proprietary memcpy, and my optimized SSE2 memcpy.
When running the release code the result is as follows:
memcpy() took:
634.193 ms.
MemCpy() took:
627.133 ms.
MemCpySse2() took:
605.482 ms.
Done
Project compiler flags:
/GS- /GA /Qrestrict /W3 /Zc:wchar_t /Zi /O3 /Fd"Release\vc110.pdb" /fp:fast /Quse-intel-optimized-headers /D "_WINDOWS" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /arch:SSE2 /Gd /Oi /MT /Fa"Release\" /EHsc /nologo /Qparallel /Fo"Release\" /Ot /Fp"Release\ConsoleApplication1.pch"
Code snippe from project:
// SSE2 optimized memcpy()
void *MemCpySse2(void *__restrict b, const void *__restrict a, size_t n)
{
char *s1 = (char*)b;
const char *s2 = (const char*)a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}// General memcpy
void *MemCpy(void *dest, const void *source, size_t count)
{
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;
}
Full Visual Studio 2012 project is attached.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ICL normally replaces memcpy() or equivalent for loops by its own library functions which attempt to choose an optimum code path at runtime.
If you would look at pre-processed code or /Qopt-report you should be able to see when this happens. According to your options, you have requested the IPP headers, but your code appears non-standard as you have memcpy() without the corresponding header file. VS2012 itself probably has a good SSE2/AVX memcpy(); on my laptop it seems to perform the same as what ICL provides.
So your observation is probably no surprise, unless it surprises you.
If you wish to suppress the library function call substitution, so as to have ICL generate in-line code, you should be able to do so by #pragma simd or #pragma omp simd. Then, if you wish to obtain streaming stores for a smaller string move than what is required for the library function to make the switch, you can set #pragma vector nontemporal.
You don't say much about what you expect, what are your goals,.....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What header must i include after /Quse-intel-optimized-headers to make use of the ICL memcpy() ? Can't find any documentation on it anywhere.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm guessing you may be getting the ICL intel_fast_memcpy even without #include <string.h> and intel-optimized-headers but you have nothing to lose by correcting the usage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel Optimization Reference Manual has a good example of mem copy routine(SSE inline assembly) with software prefetches and loop unrolling.Adding prefetch instructions should improve the performance beacuse of linear pointer arithmetics.I mean indices are not randomized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you invoke memcpy explicitly and don't get a link failure, it means you are using a memcpy from the compiler support library (aside from a few cases where a compiler may view that a pair of in-line instructions performs it better). You would be able to see from /Qopt-report or by using dumpbin whether it was a substitution of intel_fast_memcpy. Again, chances are you got the same substitution even though your source code isn't technically correct.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page