- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
I did a performance test with IPP 6.1.5 between memset (vs2010) and ippsSet_8u. I get the following results with filling 1MB of bytes:
- memset is faster than ipp when the value to be set is not 0 (4.5s vs 5.6s)
- memset has equal performance when the value to be set is 0 (5.6s vs 5.6s)
From the code of M$ I can see that memset uses SSE instructions only in case the value to be set is 0. But does above means that SSE is actually slower on this type of cpu?
pc: dell workstation T3500, 6GB, x64 (application is 32 bit), cpu: 4 core W3530 xeon 2.8 GHz. IPP dispathes to ippsp8-6.1.dll.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did a performance test with IPP 6.1.5 between memset (vs2010) and ippsSet_8u. I get the following results with filling 1MB of bytes:
- memset is faster than ipp when the value to be set is not 0 (4.5s vs 5.6s)
- memset has equal performance when the value to be set is 0 (5.6s vs 5.6s)
From the code of M$ I can see that memset uses SSE instructions only in case the value to be set is 0. But does above means that SSE is actually slower on this type of cpu?
I think no. Take into account thatin Debug configuration'memset' CRT-function could beslower that in Release configuration.
Not related to the subject... Some time ago I've tested 'memcpy' and SSE-based 'memcpysee' functions.
In case of memory blocks greater than 64KB 'memcpysse' outperforms 'memcpy'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes but read my post carefully: memset is always faster. Also M$ memset is fully implemented in assembly (no difference between DEBUG / RELEASE). Besides that the test was performed in release mode.
So question still stands. SSE makes things slower on current day processors?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The test looks to need several minutes to complete: Did you repeat calling the memory set function for several times, and check the performance? Since the data is only having 1M byte, the data will write into the Cache after first call. So most likely, the test is measuring the performing on writing data in the Cache. Possibly you can have test with some large data (e.g. > 20M, larger than the last level cache), and find what is the performance.
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I looped 100000 times on the same memory for both calls, so the cache would indeed be filled. That might be not a good test indeed, since in 'normal' code the call would operate mostly on non cached data I suppose.
If I re-run the test with 5000 loops and 20 MB buffer to be filled (which is larger than the L2/L3 cache), I get the following results:
- memset is slower than ipp with value !=0 (9.3s vs 8.2s)
- memset is much slower than ipp with value ==0 (13.0s vs 8.3s)
Strange that m$ impl. is so much slower with its SSE implementation. I cannot explain why IPP is slower with smaller buffers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
with smaller buffers...
I think you need a vTune to analyze the case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vtune is not free and requires pretty deep knowledge to operate and to interpret. I suppose i have to learn SSE assembly after all :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've completed several tests and a 'ippsSet_8u' IPP function is always ~3xfasterthan a 'memset' CRT function.
Here are some results:
Data size: 1MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.346832
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.357576
Data size: 4MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.346939
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.332440
Data size: 8MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.359064
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.343120
Data size: 16MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.351576
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.336900
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is my Test-Case:
[cpp] RTint iDataSize; iDataSize = 1024 * 1024 * 1; // 1MB // iDataSize = 1024 * 1024 * 4; // 4MB // iDataSize = 1024 * 1024 * 8; // 8MB // iDataSize = 1024 * 1024 * 16; // 16MB CrtPrintf( RTU("Data size: %2ldMBn"), ( iDataSize / 1024 / 1024 ) ); Ipp8u *puData = RTnull; RTubyte *pubData = RTnull; // _RTALIGN08 Ipp8u *puData = RTnull; // _RTALIGN08 RTubyte *pubData = RTnull; // _RTALIGN16 Ipp8u *puData = RTnull; // _RTALIGN16 RTubyte *pubData = RTnull; // _RTALIGN32 Ipp8u *puData = RTnull; // _RTALIGN32 RTubyte *pubData = RTnull; RTuint uiTicksDelta1 = 0; RTuint uiTicksDelta2 = 0; RTint t; while( RTtrue ) { puData = ( Ipp8u * )::ippsMalloc_8u( iDataSize * sizeof( Ipp8u ) ); if( puData == RTnull ) break; pubData = ( RTubyte * )CrtMalloc( iDataSize * sizeof( RTubyte ) ); if( puData == RTnull ) break; CrtPrintf( RTU("REAL TIMEn") ); ::SetPriorityClass( ::GetCurrentProcess(), REALTIME_PRIORITY_CLASS ); // Test-Case 1 { g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { ::ippsSet_8u( 0x0, puData, iDataSize ); } uiTicksDelta1 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ ippsSet_8u - Set 0 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta1 ); g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { CrtMemset( pubData, 0x0, iDataSize ); } uiTicksDelta2 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ CrtMemset - Set 0 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta2 ); CrtPrintf( RTU("[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = %fn"),( ( RTfloat )uiTicksDelta1 / ( RTfloat )uiTicksDelta2 ) ); } // Test-Case 2 { g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { ::ippsSet_8u( 0x1, puData, iDataSize ); } uiTicksDelta1 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ ippsSet_8u - Set 1 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta1 ); g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { CrtMemset( pubData, 0x1, iDataSize ); } uiTicksDelta2 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ CrtMemset - Set 1 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta2 ); CrtPrintf( RTU("[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = %fn"),
( ( RTfloat )uiTicksDelta1 / ( RTfloat )uiTicksDelta2 ) ); } ::SetPriorityClass( ::GetCurrentProcess(), NORMAL_PRIORITY_CLASS ); break; } if( puData != RTnull ) { ::ippsFree( puData ); puData = RTnull; } if( pubData != RTnull ) { CrtFree( pubData ); pubData = RTnull; } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did the test on 2 DELL pc's with the following cpu's:
- Xeon w3530 (cpu from 2010)
- Pentium 4 ht (cpu from 2002)
I get the following results:
XeonPentiumcrt9.284ipp8.224
It also makes a small difference if you are using IPP 5.3.4 or 6.1.5: the ippsSet_8u is with IPP 5.3.4 slower on the Xeon processor (9 vs 12 seconds). Your measured factor 3 I only get on the Pentium cpu. On the Xeon crt memset is comparable with ippsSet_8u. Maybe you can post your cpu as well?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thanks for the code and result: from the table, it looks that IPP are still faster. Also IPP 5.3.4 is some old release, may not run best optimized code at the new processors.
Thanks,
Chao

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page