- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Conclusion ]
To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.
Link Copied
41 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Abstract ]
I recently completed an analysis of some C codes to initalize a large 3-D data set with dimensions 8192 x 4 x 8192 ( X-Y-Z ). In overall, the data set has 268,435,456 Single Precision Floating Point data type elements.
Since in Y direction there are only 4 elements a 128-bit Streaming store Intel intrinsic _mm_stream_ps function was used ( Test-case 2 ) instead of primitive assignments ( Test-case 1 ) in an Unrolled For-Loop with 4-in-1 schema.
Three C++ compilers were used and their versions are as follows:
Microsoft C++ compiler: 14.00.50727.762 ( default in VS 2005 )
Intel C++ compiler: 12.1.7.371
MinGW C++ compiler: 4.9.0
I would rate all of them as legacy C++ compilers since they were released about 5 to 10 years ago.
Take into account that a main purpose of the analysis was investigation if Streaming stores are making initialization of the data set faster regardless of C++ compiler used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-case 1 ]
[ C Source codes of Test-Case - without 128-bit Streaming Stores ]
...
RTssize_t i;
for( i = 0; i < m_iSize4; i += 4 )
{
m_ptData1D[i ] = ( T )rtValue;
m_ptData1D[i+1] = ( T )rtValue;
m_ptData1D[i+2] = ( T )rtValue;
m_ptData1D[i+3] = ( T )rtValue;
}
...
[ Test-case 2 ]
[ C Source codes of Test-Case - with 128-bit Streaming Stores ]
...
RTssize_t i;
for( i = 0; i < m_iSize4; i += 4 )
{
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue );
}
...
Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ).
Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler - Generated almost perfect assembler codes ]
I also looked at assembler codes generated by these C++ compilers and I was very impressed how MinGW C++ compiler generated almost perfect codes. It used the same schema for both cases, without Streaming stores and with Streaming, and they differ only in what assignment instruction was used:
- In case of codes without Streaming stores movaps instruction was used
...
00403520 movaps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movaps xmmword ptr [eax-30h], xmm5
0040352A movaps xmmword ptr [eax-20h], xmm5
0040352E movaps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...
- In case of codes with Streaming stores movntps instruction was used
...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...
As you can see assembler codes for the main processing of a C For-Loop are identical!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-case 1 - without 128-bit Streaming Stores ]
[ C Source codes of Test-Case - without 128-bit Streaming Stores ]
...
RTssize_t i;
for( i = 0; i < m_iSize4; i += 4 )
{
m_ptData1D[i ] = ( T )rtValue;
m_ptData1D[i+1] = ( T )rtValue;
m_ptData1D[i+2] = ( T )rtValue;
m_ptData1D[i+3] = ( T )rtValue;
}
...
Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler - without 128-bit Streaming Stores ]
...
Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_MSC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 23.625 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler - without 128-bit Streaming Stores ]
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_ICC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 26.216 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler - without 128-bit Streaming Stores ]
...
Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_MGW ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 21.735 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler assembler codes - without 128-bit Streaming Stores ]
...
00243690 mov edx, dword ptr [esi+80h]
00243696 movaps xmmword ptr [edx+eax], xmm0
0024369A mov edx, dword ptr [esi+80h]
002436A0 movaps xmmword ptr [eax+edx+10h], xmm0
002436A5 mov edx, dword ptr [esi+80h]
002436AB movaps xmmword ptr [eax+edx+20h], xmm0
002436B0 mov edx, dword ptr [esi+80h]
002436B6 movaps xmmword ptr [edx+eax+30h], xmm0
002436BB add ecx, 4
002436BE add eax, 40h
002436C1 cmp ecx, dword ptr [esi+0D0h]
002436C7 jl CDataSet::RunTest+310h (243690h)
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler assembler codes - without 128-bit Streaming Stores ]
...
0040143D movaps xmm0, xmmword ptr [ebp-358h]
00401444 inc edx
00401445 movaps xmmword ptr [ecx+esi], xmm0
00401449 movaps xmmword ptr [ecx+esi+10h], xmm0
0040144E movaps xmmword ptr [ecx+esi+20h], xmm0
00401453 movaps xmmword ptr [ecx+esi+30h], xmm0
00401458 add ecx, 40h
0040145B cmp edx, eax
0040145D jb CDataSet::RunTest+28Dh (40143Dh)
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler assembler codes - without 128-bit Streaming Stores ]
...
00403520 movaps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movaps xmmword ptr [eax-30h], xmm5
0040352A movaps xmmword ptr [eax-20h], xmm5
0040352E movaps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-case 2 - with 128-bit Streaming Stores ]
[ C Source codes of Test-Case - with 128-bit Streaming Stores ]
...
RTssize_t i;
for( i = 0; i < m_iSize4; i += 4 )
{
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i ], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+1], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+2], rtValue );
CrtStreamPs128( ( RTfloat * )&m_ptData1D[i+3], rtValue );
}
...
Note 1: rtValue is declared as a variable of __m128 type, that is, it has 4 members of type float ( Single Precision Floating Point ).
Note 2: CrtStreamPs128 function is a portable wrapper around Intel _mm_stream_ps intrinsic function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler - with 128-bit Streaming Stores ]
...
Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_MSC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 23.203 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler - with 128-bit Streaming Stores ]
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_ICC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 25.766 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler - with 128-bit Streaming Stores ]
...
Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
*****************************************************************************
Configuration - WIN32_MGW ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed
* CDataSet Start *
> TDataSet Methods <
DataSet::< RTm128 > - Passed
> CDataSet Methods <
> CDataSet Algorithms <
* CDataSet End *
Test Completed in 21.516 secs
> Test0001 End <
Tests: Completed
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler assembler codes - with 128-bit Streaming Stores ]
...
00243690 mov ecx, dword ptr [esi+80h]
00243696 movntps xmmword ptr [ecx+eax], xmm0
0024369A add ecx, eax
0024369C mov ecx, dword ptr [esi+80h]
002436A2 movntps xmmword ptr [eax+ecx+10h], xmm0
002436A7 mov ebx, dword ptr [esi+80h]
002436AD lea ecx, [eax+30h]
002436B0 movntps xmmword ptr [ecx+ebx-10h], xmm0
002436B5 mov ebx, dword ptr [esi+80h]
002436BB add ebx, ecx
002436BD add edx, 4
002436C0 movntps xmmword ptr [ebx], xmm0
002436C3 add eax, 40h
002436C6 cmp edx, dword ptr [esi+0D0h]
002436CC jl CDataSet::RunTest+310h (243690h)
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler assembler codes - with 128-bit Streaming Stores ]
...
00401BBD mov ecx, edx
00401BBF add edx, 4
00401BC2 shl ecx, 4
00401BC5 movaps xmm0, xmmword ptr [ebp-358h]
00401BCC cmp edx, eax
00401BCE movntps xmmword ptr [ecx+esi], xmm0
00401BD2 movntps xmmword ptr [ecx+esi+10h], xmm0
00401BD7 movntps xmmword ptr [ecx+esi+20h], xmm0
00401BDC movntps xmmword ptr [ecx+esi+30h], xmm0
00401BE1 jl CDataSet::RunTest+26Dh (401BBDh)
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]
...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...
Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Summary of Performance evaluation 128-bit Streaming store codes - 1 ]
1. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 7.3% than codes generated by Microsoft C++ compiler.
2. Codes generated by MinGW C++ compiler with 128-bit Streaming stores were faster by 16.5% than codes generated by Intel C++ compiler.
3. Without 128-bit Streaming Stores
...
Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Test Completed in 23.625 secs
...
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Test Completed in 26.216 secs
...
...
Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Test Completed in 21.735 secs
...
4. With 128-bit Streaming Stores
...
Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Test Completed in 23.203 secs
...
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Test Completed in 25.766 secs
...
...
Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Test Completed in 21.516 secs
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Summary of Performance evaluation 128-bit Streaming store codes - 2 ]
Or in another form:
Microsoft C++ compiler: 23.625 secs ( without Streaming store ) vs. 23.203 secs ( with Streaming store )
Summary: With Streaming store initialization of the data set is ~1.8% faster.
Intel C++ compiler: 26.216 secs ( without Streaming store ) vs. 25.766 secs ( with Streaming store )
Summary: With Streaming store initialization of the data set is ~1.7% faster.
MinGW C++ compiler: 21.735 secs ( without Streaming store ) vs. 21.516 secs ( with Streaming store )
Summary: With Streaming store initialization of the data set is ~1.0% faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Conclusion ]
To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page