performance of memset vs ippsSet_8u

gast128 · ‎03-13-2012

Hello all,

I did a performance test with IPP 6.1.5 between memset (vs2010) and ippsSet_8u. I get the following results with filling 1MB of bytes:

memset is faster than ipp when the value to be set is not 0 (4.5s vs 5.6s)
memset has equal performance when the value to be set is 0 (5.6s vs 5.6s)

From the code of M$ I can see that memset uses SSE instructions only in case the value to be set is 0. But does above means that SSE is actually slower on this type of cpu?

pc: dell workstation T3500, 6GB, x64 (application is 32 bit), cpu: 4 core W3530 xeon 2.8 GHz. IPP dispathes to ippsp8-6.1.dll.

SergeyKostrov · ‎03-13-2012

Quoting gast128

Hello all,

I did a performance test with IPP 6.1.5 between memset (vs2010) and ippsSet_8u. I get the following results with filling 1MB of bytes:
memset is faster than ipp when the value to be set is not 0 (4.5s vs 5.6s)
memset has equal performance when the value to be set is 0 (5.6s vs 5.6s)
From the code of M$ I can see that memset uses SSE instructions only in case the value to be set is 0. But does above means that SSE is actually slower on this type of cpu?

I think no. Take into account thatin Debug configuration'memset' CRT-function could beslower that in Release configuration.

Not related to the subject... Some time ago I've tested 'memcpy' and SSE-based 'memcpysee' functions.
In case of memory blocks greater than 64KB 'memcpysse' outperforms 'memcpy'.

gast128 · ‎03-13-2012

Yes but read my post carefully: memset is always faster. Also M$ memset is fully implemented in assembly (no difference between DEBUG / RELEASE). Besides that the test was performed in release mode.

So question still stands. SSE makes things slower on current day processors?

Chao_Y_Intel · ‎03-13-2012

Hello,

The test looks to need several minutes to complete: Did you repeat calling the memory set function for several times, and check the performance? Since the data is only having 1M byte, the data will write into the Cache after first call. So most likely, the test is measuring the performing on writing data in the Cache. Possibly you can have test with some large data (e.g. > 20M, larger than the last level cache), and find what is the performance.

Thanks,
Chao

gast128 · ‎03-14-2012

I looped 100000 times on the same memory for both calls, so the cache would indeed be filled. That might be not a good test indeed, since in 'normal' code the call would operate mostly on non cached data I suppose.

If I re-run the test with 5000 loops and 20 MB buffer to be filled (which is larger than the L2/L3 cache), I get the following results:

memset is slower than ipp with value !=0 (9.3s vs 8.2s)
memset is much slower than ipp with value ==0 (13.0s vs 8.3s)

Strange that m$ impl. is so much slower with its SSE implementation. I cannot explain why IPP is slower with smaller buffers.

SergeyKostrov · ‎03-14-2012

Quoting gast128

...Strange that m$ impl. is so much slower with its SSE implementation. I cannot explain why IPP is slower
with smaller buffers...

I think you need a vTune to analyze the case.

gast128 · ‎03-14-2012

Vtune is not free and requires pretty deep knowledge to operate and to interpret. I suppose i have to learn SSE assembly after all :(

SergeyKostrov · ‎03-18-2012

I've completed several tests and a 'ippsSet_8u' IPP function is always ~3xfasterthan a 'memset' CRT function.
Here are some results:

Data size: 1MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.346832
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.357576

Data size: 4MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.346939
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.332440

Data size: 8MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.359064
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.343120

Data size: 16MB
REAL TIME
[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.351576
[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = 0.336900

SergeyKostrov · ‎03-18-2012

Here is my Test-Case:

[cpp] RTint iDataSize; iDataSize = 1024 * 1024 * 1; // 1MB // iDataSize = 1024 * 1024 * 4; // 4MB // iDataSize = 1024 * 1024 * 8; // 8MB // iDataSize = 1024 * 1024 * 16; // 16MB CrtPrintf( RTU("Data size: %2ldMBn"), ( iDataSize / 1024 / 1024 ) ); Ipp8u *puData = RTnull; RTubyte *pubData = RTnull; // _RTALIGN08 Ipp8u *puData = RTnull; // _RTALIGN08 RTubyte *pubData = RTnull; // _RTALIGN16 Ipp8u *puData = RTnull; // _RTALIGN16 RTubyte *pubData = RTnull; // _RTALIGN32 Ipp8u *puData = RTnull; // _RTALIGN32 RTubyte *pubData = RTnull; RTuint uiTicksDelta1 = 0; RTuint uiTicksDelta2 = 0; RTint t; while( RTtrue ) { puData = ( Ipp8u * )::ippsMalloc_8u( iDataSize * sizeof( Ipp8u ) ); if( puData == RTnull ) break; pubData = ( RTubyte * )CrtMalloc( iDataSize * sizeof( RTubyte ) ); if( puData == RTnull ) break; CrtPrintf( RTU("REAL TIMEn") ); ::SetPriorityClass( ::GetCurrentProcess(), REALTIME_PRIORITY_CLASS ); // Test-Case 1 { g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { ::ippsSet_8u( 0x0, puData, iDataSize ); } uiTicksDelta1 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ ippsSet_8u - Set 0 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta1 ); g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { CrtMemset( pubData, 0x0, iDataSize ); } uiTicksDelta2 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ CrtMemset - Set 0 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta2 ); CrtPrintf( RTU("[ Set 0 - ( ippsSet_8u / CrtMemset ) Ratio ] = %fn"),
( ( RTfloat )uiTicksDelta1 / ( RTfloat )uiTicksDelta2 ) ); } // Test-Case 2 { g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { ::ippsSet_8u( 0x1, puData, iDataSize ); } uiTicksDelta1 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ ippsSet_8u - Set 1 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta1 ); g_uiTicksStart = SysGetTickCount(); for( t = 0; t < NUMBER_OF_TESTS_0000001024; t++ ) { CrtMemset( pubData, 0x1, iDataSize ); } uiTicksDelta2 = ( SysGetTickCount() - g_uiTicksStart ); // CrtPrintf( RTU("[ CrtMemset - Set 1 ] Executed in: %5ld ticksn"), ( RTint )uiTicksDelta2 ); CrtPrintf( RTU("[ Set 1 - ( ippsSet_8u / CrtMemset ) Ratio ] = %fn"),
( ( RTfloat )uiTicksDelta1 / ( RTfloat )uiTicksDelta2 ) ); } ::SetPriorityClass( ::GetCurrentProcess(), NORMAL_PRIORITY_CLASS ); break; } if( puData != RTnull ) { ::ippsFree( puData ); puData = RTnull; } if( pubData != RTnull ) { CrtFree( pubData ); pubData = RTnull; } [/cpp]

gast128 · ‎03-22-2012

I did the test on 2 DELL pc's with the following cpu's:

Xeon w3530 (cpu from 2010)
Pentium 4 ht (cpu from 2002)

I get the following results:

XeonPentiumcrt9.284ipp8.224

It also makes a small difference if you are using IPP 5.3.4 or 6.1.5: the ippsSet_8u is with IPP 5.3.4 slower on the Xeon processor (9 vs 12 seconds). Your measured factor 3 I only get on the Pentium cpu. On the Xeon crt memset is comparable with ippsSet_8u. Maybe you can post your cpu as well?

Chao_Y_Intel · ‎03-22-2012

Hello,

Thanks for the code and result: from the table, it looks that IPP are still faster. Also IPP 5.3.4 is some old release, may not run best optimized code at the new processors.

Thanks,
Chao