Software Archive
Read-only legacy content
17061 Discussions

Analysis of 128-bit Streaming store codes vs. Non Streaming store codes

SergeyKostrov
Valued Contributor II
1,811 Views
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
0 Kudos
1 Solution
SergeyKostrov
Valued Contributor II
1,799 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

View solution in original post

0 Kudos
41 Replies
SergeyKostrov
Valued Contributor II
1,220 Views
[ Abstract ] I recently completed an analysis of some C codes to initalize a large 3-D data set with dimensions 8192 x 4 x 8192 ( X-Y-Z ). In overall, the data set has 268,435,456 Single Precision Floating Point data type elements. Since in Y direction there are only 4 elements a 128-bit Streaming store Intel intrinsic _mm_stream_ps function was used ( Test-case 2 ) instead of primitive assignments ( Test-case 1 ) in an Unrolled For-Loop with 4-in-1 schema. Three C++ compilers were used and their versions are as follows: Microsoft C++ compiler: 14.00.50727.762 ( default in VS 2005 ) Intel C++ compiler: 12.1.7.371 MinGW C++ compiler: 4.9.0 I would rate all of them as legacy C++ compilers since they were released about 5 to 10 years ago. Take into account that a main purpose of the analysis was investigation if Streaming stores are making initialization of the data set faster regardless of C++ compiler used.
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Test-case 1 ] [ C Source codes of Test-Case - without 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { m_ptData1D[i ] = ( T )rtValue; m_ptData1D[i+1] = ( T )rtValue; m_ptData1D[i+2] = ( T )rtValue; m_ptData1D[i+3] = ( T )rtValue; } ... [ Test-case 2 ] [ C Source codes of Test-Case - with 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue ); } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ). Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ MinGW C++ compiler - Generated almost perfect assembler codes ] I also looked at assembler codes generated by these C++ compilers and I was very impressed how MinGW C++ compiler generated almost perfect codes. It used the same schema for both cases, without Streaming stores and with Streaming, and they differ only in what assignment instruction was used: - In case of codes without Streaming stores movaps instruction was used ... 00403520 movaps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movaps xmmword ptr [eax-30h], xmm5 0040352A movaps xmmword ptr [eax-20h], xmm5 0040352E movaps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... - In case of codes with Streaming stores movntps instruction was used ... 00403520 movntps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movntps xmmword ptr [eax-30h], xmm5 0040352A movntps xmmword ptr [eax-20h], xmm5 0040352E movntps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... As you can see assembler codes for the main processing of a C For-Loop are identical!
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Test-case 1 - without 128-bit Streaming Stores ] [ C Source codes of Test-Case - without 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { m_ptData1D[i ] = ( T )rtValue; m_ptData1D[i+1] = ( T )rtValue; m_ptData1D[i+2] = ( T )rtValue; m_ptData1D[i+3] = ( T )rtValue; } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ).
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Microsoft C++ compiler - without 128-bit Streaming Stores ] ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MSC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 23.625 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Intel C++ compiler - without 128-bit Streaming Stores ] ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_ICC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 26.216 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ MinGW C++ compiler - without 128-bit Streaming Stores ] ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MGW ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 21.735 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Microsoft C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 00243690 mov edx, dword ptr [esi+80h] 00243696 movaps xmmword ptr [edx+eax], xmm0 0024369A mov edx, dword ptr [esi+80h] 002436A0 movaps xmmword ptr [eax+edx+10h], xmm0 002436A5 mov edx, dword ptr [esi+80h] 002436AB movaps xmmword ptr [eax+edx+20h], xmm0 002436B0 mov edx, dword ptr [esi+80h] 002436B6 movaps xmmword ptr [edx+eax+30h], xmm0 002436BB add ecx, 4 002436BE add eax, 40h 002436C1 cmp ecx, dword ptr [esi+0D0h] 002436C7 jl CDataSet::RunTest+310h (243690h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Intel C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 0040143D movaps xmm0, xmmword ptr [ebp-358h] 00401444 inc edx 00401445 movaps xmmword ptr [ecx+esi], xmm0 00401449 movaps xmmword ptr [ecx+esi+10h], xmm0 0040144E movaps xmmword ptr [ecx+esi+20h], xmm0 00401453 movaps xmmword ptr [ecx+esi+30h], xmm0 00401458 add ecx, 40h 0040145B cmp edx, eax 0040145D jb CDataSet::RunTest+28Dh (40143Dh) ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,219 Views
[ MinGW C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 00403520 movaps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movaps xmmword ptr [eax-30h], xmm5 0040352A movaps xmmword ptr [eax-20h], xmm5 0040352E movaps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Test-case 2 - with 128-bit Streaming Stores ] [ C Source codes of Test-Case - with 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue ); } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ). Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
0 Kudos
SergeyKostrov
Valued Contributor II
1,219 Views
[ Microsoft C++ compiler - with 128-bit Streaming Stores ] ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MSC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 23.203 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Intel C++ compiler - with 128-bit Streaming Stores ] ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_ICC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 25.766 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ MinGW C++ compiler - with 128-bit Streaming Stores ] ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MGW ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 21.516 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Microsoft C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00243690 mov ecx, dword ptr [esi+80h] 00243696 movntps xmmword ptr [ecx+eax], xmm0 0024369A add ecx, eax 0024369C mov ecx, dword ptr [esi+80h] 002436A2 movntps xmmword ptr [eax+ecx+10h], xmm0 002436A7 mov ebx, dword ptr [esi+80h] 002436AD lea ecx, [eax+30h] 002436B0 movntps xmmword ptr [ecx+ebx-10h], xmm0 002436B5 mov ebx, dword ptr [esi+80h] 002436BB add ebx, ecx 002436BD add edx, 4 002436C0 movntps xmmword ptr [ebx], xmm0 002436C3 add eax, 40h 002436C6 cmp edx, dword ptr [esi+0D0h] 002436CC jl CDataSet::RunTest+310h (243690h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Intel C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00401BBD mov ecx, edx 00401BBF add edx, 4 00401BC2 shl ecx, 4 00401BC5 movaps xmm0, xmmword ptr [ebp-358h] 00401BCC cmp edx, eax 00401BCE movntps xmmword ptr [ecx+esi], xmm0 00401BD2 movntps xmmword ptr [ecx+esi+10h], xmm0 00401BD7 movntps xmmword ptr [ecx+esi+20h], xmm0 00401BDC movntps xmmword ptr [ecx+esi+30h], xmm0 00401BE1 jl CDataSet::RunTest+26Dh (401BBDh) ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00403520 movntps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movntps xmmword ptr [eax-30h], xmm5 0040352A movntps xmmword ptr [eax-20h], xmm5 0040352E movntps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Summary of Performance evaluation 128-bit Streaming store codes - 1 ] 1. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 7.3% than codes generated by Microsoft C++ compiler. 2. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 16.5% than codes generated by Intel C++ compiler. 3. Without 128-bit Streaming Stores ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 23.625 secs ... ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Test Completed in 26.216 secs ... ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 21.735 secs ... 4. With 128-bit Streaming Stores ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 23.203 secs ... ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Test Completed in 25.766 secs ... ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Test Completed in 21.516 secs ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,220 Views
[ Summary of Performance evaluation 128-bit Streaming store codes - 2 ] Or in another form: Microsoft C++ compiler: 23.625 secs ( without Streaming store ) vs. 23.203 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.8% faster. Intel C++ compiler: 26.216 secs ( without Streaming store ) vs. 25.766 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.7% faster. MinGW C++ compiler: 21.735 secs ( without Streaming store ) vs. 21.516 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.0% faster.
0 Kudos
SergeyKostrov
Valued Contributor II
1,800 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.
0 Kudos
Reply