Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

Analysis of 128-bit Streaming store codes vs. Non Streaming store codes

SergeyKostrov
Valued Contributor II
5,162 Views
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
0 Kudos
1 Solution
SergeyKostrov
Valued Contributor II
5,150 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

View solution in original post

0 Kudos
41 Replies
SergeyKostrov
Valued Contributor II
3,309 Views
[ Abstract ] I recently completed an analysis of some C codes to initalize a large 3-D data set with dimensions 8192 x 4 x 8192 ( X-Y-Z ). In overall, the data set has 268,435,456 Single Precision Floating Point data type elements. Since in Y direction there are only 4 elements a 128-bit Streaming store Intel intrinsic _mm_stream_ps function was used ( Test-case 2 ) instead of primitive assignments ( Test-case 1 ) in an Unrolled For-Loop with 4-in-1 schema. Three C++ compilers were used and their versions are as follows: Microsoft C++ compiler: 14.00.50727.762 ( default in VS 2005 ) Intel C++ compiler: 12.1.7.371 MinGW C++ compiler: 4.9.0 I would rate all of them as legacy C++ compilers since they were released about 5 to 10 years ago. Take into account that a main purpose of the analysis was investigation if Streaming stores are making initialization of the data set faster regardless of C++ compiler used.
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Test-case 1 ] [ C Source codes of Test-Case - without 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { m_ptData1D[i ] = ( T )rtValue; m_ptData1D[i+1] = ( T )rtValue; m_ptData1D[i+2] = ( T )rtValue; m_ptData1D[i+3] = ( T )rtValue; } ... [ Test-case 2 ] [ C Source codes of Test-Case - with 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue ); } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ). Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ MinGW C++ compiler - Generated almost perfect assembler codes ] I also looked at assembler codes generated by these C++ compilers and I was very impressed how MinGW C++ compiler generated almost perfect codes. It used the same schema for both cases, without Streaming stores and with Streaming, and they differ only in what assignment instruction was used: - In case of codes without Streaming stores movaps instruction was used ... 00403520 movaps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movaps xmmword ptr [eax-30h], xmm5 0040352A movaps xmmword ptr [eax-20h], xmm5 0040352E movaps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... - In case of codes with Streaming stores movntps instruction was used ... 00403520 movntps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movntps xmmword ptr [eax-30h], xmm5 0040352A movntps xmmword ptr [eax-20h], xmm5 0040352E movntps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... As you can see assembler codes for the main processing of a C For-Loop are identical!
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Test-case 1 - without 128-bit Streaming Stores ] [ C Source codes of Test-Case - without 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { m_ptData1D[i ] = ( T )rtValue; m_ptData1D[i+1] = ( T )rtValue; m_ptData1D[i+2] = ( T )rtValue; m_ptData1D[i+3] = ( T )rtValue; } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ).
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Microsoft C++ compiler - without 128-bit Streaming Stores ] ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MSC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 23.625 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Intel C++ compiler - without 128-bit Streaming Stores ] ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_ICC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 26.216 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ MinGW C++ compiler - without 128-bit Streaming Stores ] ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MGW ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 21.735 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Microsoft C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 00243690 mov edx, dword ptr [esi+80h] 00243696 movaps xmmword ptr [edx+eax], xmm0 0024369A mov edx, dword ptr [esi+80h] 002436A0 movaps xmmword ptr [eax+edx+10h], xmm0 002436A5 mov edx, dword ptr [esi+80h] 002436AB movaps xmmword ptr [eax+edx+20h], xmm0 002436B0 mov edx, dword ptr [esi+80h] 002436B6 movaps xmmword ptr [edx+eax+30h], xmm0 002436BB add ecx, 4 002436BE add eax, 40h 002436C1 cmp ecx, dword ptr [esi+0D0h] 002436C7 jl CDataSet::RunTest+310h (243690h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Intel C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 0040143D movaps xmm0, xmmword ptr [ebp-358h] 00401444 inc edx 00401445 movaps xmmword ptr [ecx+esi], xmm0 00401449 movaps xmmword ptr [ecx+esi+10h], xmm0 0040144E movaps xmmword ptr [ecx+esi+20h], xmm0 00401453 movaps xmmword ptr [ecx+esi+30h], xmm0 00401458 add ecx, 40h 0040145B cmp edx, eax 0040145D jb CDataSet::RunTest+28Dh (40143Dh) ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,308 Views
[ MinGW C++ compiler assembler codes - without 128-bit Streaming Stores ] ... 00403520 movaps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movaps xmmword ptr [eax-30h], xmm5 0040352A movaps xmmword ptr [eax-20h], xmm5 0040352E movaps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Test-case 2 - with 128-bit Streaming Stores ] [ C Source codes of Test-Case - with 128-bit Streaming Stores ] ... RTssize_t i; for( i = 0; i < m_iSize4; i += 4 ) { CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue ); CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue ); } ... Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ). Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
0 Kudos
SergeyKostrov
Valued Contributor II
3,308 Views
[ Microsoft C++ compiler - with 128-bit Streaming Stores ] ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MSC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 23.203 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Intel C++ compiler - with 128-bit Streaming Stores ] ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_ICC ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 25.766 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ MinGW C++ compiler - with 128-bit Streaming Stores ] ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Tests: Start > Test0001 Start < ***************************************************************************** Configuration - WIN32_MGW ( 32-bit ) - Release CTestSet::InitTestEnv - Passed * CDataSet Start * > TDataSet Methods < DataSet::< RTm128 > - Passed > CDataSet Methods < > CDataSet Algorithms < * CDataSet End * Test Completed in 21.516 secs > Test0001 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Microsoft C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00243690 mov ecx, dword ptr [esi+80h] 00243696 movntps xmmword ptr [ecx+eax], xmm0 0024369A add ecx, eax 0024369C mov ecx, dword ptr [esi+80h] 002436A2 movntps xmmword ptr [eax+ecx+10h], xmm0 002436A7 mov ebx, dword ptr [esi+80h] 002436AD lea ecx, [eax+30h] 002436B0 movntps xmmword ptr [ecx+ebx-10h], xmm0 002436B5 mov ebx, dword ptr [esi+80h] 002436BB add ebx, ecx 002436BD add edx, 4 002436C0 movntps xmmword ptr [ebx], xmm0 002436C3 add eax, 40h 002436C6 cmp edx, dword ptr [esi+0D0h] 002436CC jl CDataSet::RunTest+310h (243690h) ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Intel C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00401BBD mov ecx, edx 00401BBF add edx, 4 00401BC2 shl ecx, 4 00401BC5 movaps xmm0, xmmword ptr [ebp-358h] 00401BCC cmp edx, eax 00401BCE movntps xmmword ptr [ecx+esi], xmm0 00401BD2 movntps xmmword ptr [ecx+esi+10h], xmm0 00401BD7 movntps xmmword ptr [ecx+esi+20h], xmm0 00401BDC movntps xmmword ptr [ecx+esi+30h], xmm0 00401BE1 jl CDataSet::RunTest+26Dh (401BBDh) ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ] ... 00403520 movntps xmmword ptr [eax], xmm5 00403523 add eax, 40h 00403526 movntps xmmword ptr [eax-30h], xmm5 0040352A movntps xmmword ptr [eax-20h], xmm5 0040352E movntps xmmword ptr [eax-10h], xmm5 00403532 cmp eax, ecx 00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h) ... Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Summary of Performance evaluation 128-bit Streaming store codes - 1 ] 1. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 7.3% than codes generated by Microsoft C++ compiler. 2. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 16.5% than codes generated by Intel C++ compiler. 3. Without 128-bit Streaming Stores ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 23.625 secs ... ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Test Completed in 26.216 secs ... ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 21.735 secs ... 4. With 128-bit Streaming Stores ... Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Test Completed in 23.203 secs ... ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Test Completed in 25.766 secs ... ... Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Test Completed in 21.516 secs ...
0 Kudos
SergeyKostrov
Valued Contributor II
3,309 Views
[ Summary of Performance evaluation 128-bit Streaming store codes - 2 ] Or in another form: Microsoft C++ compiler: 23.625 secs ( without Streaming store ) vs. 23.203 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.8% faster. Intel C++ compiler: 26.216 secs ( without Streaming store ) vs. 25.766 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.7% faster. MinGW C++ compiler: 21.735 secs ( without Streaming store ) vs. 21.516 secs ( with Streaming store ) Summary: With Streaming store initialization of the data set is ~1.0% faster.
0 Kudos
SergeyKostrov
Valued Contributor II
5,151 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.
0 Kudos
Reply