Community
cancel
Showing results for 
Search instead for 
Did you mean: 
SergeyKostrov
Valued Contributor II
135 Views

Analysis of 128-bit Streaming store codes vs. Non Streaming store codes

Jump to solution
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
0 Kudos
41 Replies
SergeyKostrov
Valued Contributor II
123 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

View solution in original post

0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
This is how a main body of the Test-case looks like: ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); RTm128float *pfDataA = ( RTm128float * )tdsA.GetData1D(); RTm128float **ppfDataA = ( RTm128float ** )tdsA.GetData2D(); ... Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class.
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
As a part of the future investigation a possible negative impact of the Virtual Memory ( since the data set is very big ) needs to be taken into account.
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.578 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.890 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 1.281 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.750 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.844 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 22.047 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.359 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.563 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 0.812 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.109 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.188 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 21.797 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.1GB ] [ Processing Completed in: 39.922 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8704, 8704 ); tdsA.SetValue( v ); ...
0 Kudos
Bernard
Black Belt
34 Views

Sergey Kostrov wrote:

[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]

...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...

Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.

Regarding this example I think that CPU scheduler/dispatcher will dispatch fused cmp/jne uop(s) to Port5 for the branch evaluation. It seems that this will be recognized as a backward branch which usually is taken, so the at the same time 1 non-temporal memory store of  16-byte can be issued. Now what is really interested how internally AGU is involved in address computation. In this example I think that DTLB buffer can store recent virtual-to-physical memory mapping so I think that AGU can access that cache and calculate address directly maybe without waiting for the branch evaluation.

0 Kudos
Bernard
Black Belt
34 Views

>>>Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class>>>

Interesting test cases.

I have one question related to using TDataSet class templated on type argument and on data size. First did you try to use blaze or blitz++ library (blaze being faster) for your data containers? Second did you encounter any problems related to optimization of that TDataSet templated class when compared to free standing dynamically allocated arrays? Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual vectorization?

Here is short example of main computation loop of  Gaussian-like Noise being vectorized. This is  a part of class which is  partially specialized on __m256d union template argument. So primary template accepts scalar typename T argument and is followed by two partial specializations for __m256d and for __m128 unions.

*Note: std::printf in the loop is only for the debugging purpose.

for (std::size_t i{ 0 }; i != vecLength; i += 4)
	{
		do
		vrv1 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		while (!(_mm256_cmp_pd(vrv1, _mm256_setzero_pd(), 0).m256d_f64));

		__m256d vrv2 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		__m256d temp1 = _mm256_set_pd(this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[0]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[1]),
			this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[2]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[3]));
		__m256d vvu1 = _mm256_mul_pd(_mm256_sqrt_pd(_mm256_mul_pd(_mm256_set1_pd(-2.0), _mm256_log_pd(vrv1))), temp1);
		__m256d vvr2 = _mm256_mul_pd(_mm256_add_pd(_mm256_set1_pd(this->m_dMean), _mm256_sqrt_pd(_mm256_set1_pd(this->m_dVariance))), vvu1);
		std::printf("v0=%.9f,v1=%.9f,v2=%.9f,v3=%.9f\n", vvr2.m256d_f64[0], vvr2.m256d_f64[1], vvr2.m256d_f64[2], vvr2.m256d_f64[3]);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).first, vvu1);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).second, vvr2);

	}

 

0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
To Ilyapolak. Here are a couple of answers to your questions... >>I have one question related to using TDataSet class templated on type argument and on data size. First did you >>try to use blaze or blitz++ library (blaze being faster) for your data containers? No and I even do not use STL in codes on the project because all these C++ libraries create additional overheads. When you have more than 1 giga single-precision elements in a data set pure C codes work faster.
0 Kudos
SergeyKostrov
Valued Contributor II
34 Views
>>Second did you encounter any problems related to optimization of that TDataSet templated class when >>compared to free standing dynamically allocated arrays? No. Everything looks good and no optimization problems detected.
0 Kudos
Reply