Software Archive
Read-only legacy content
17061 Discussions

Analysis of 128-bit Streaming store codes vs. Non Streaming store codes

SergeyKostrov
Valued Contributor II
1,810 Views
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
0 Kudos
1 Solution
SergeyKostrov
Valued Contributor II
1,798 Views
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

View solution in original post

0 Kudos
41 Replies
SergeyKostrov
Valued Contributor II
567 Views
This is how a main body of the Test-case looks like: ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); RTm128float *pfDataA = ( RTm128float * )tdsA.GetData1D(); RTm128float **ppfDataA = ( RTm128float ** )tdsA.GetData2D(); ... Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class.
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
As a part of the future investigation a possible negative impact of the Virtual Memory ( since the data set is very big ) needs to be taken into account.
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.578 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.890 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 1.281 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.750 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.844 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 22.047 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.359 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.563 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 0.812 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.109 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.188 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 21.797 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.1GB ] [ Processing Completed in: 39.922 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8704, 8704 ); tdsA.SetValue( v ); ...
0 Kudos
Bernard
Valued Contributor I
567 Views

Sergey Kostrov wrote:

[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]

...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...

Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.

Regarding this example I think that CPU scheduler/dispatcher will dispatch fused cmp/jne uop(s) to Port5 for the branch evaluation. It seems that this will be recognized as a backward branch which usually is taken, so the at the same time 1 non-temporal memory store of  16-byte can be issued. Now what is really interested how internally AGU is involved in address computation. In this example I think that DTLB buffer can store recent virtual-to-physical memory mapping so I think that AGU can access that cache and calculate address directly maybe without waiting for the branch evaluation.

0 Kudos
Bernard
Valued Contributor I
567 Views

>>>Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class>>>

Interesting test cases.

I have one question related to using TDataSet class templated on type argument and on data size. First did you try to use blaze or blitz++ library (blaze being faster) for your data containers? Second did you encounter any problems related to optimization of that TDataSet templated class when compared to free standing dynamically allocated arrays? Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual vectorization?

Here is short example of main computation loop of  Gaussian-like Noise being vectorized. This is  a part of class which is  partially specialized on __m256d union template argument. So primary template accepts scalar typename T argument and is followed by two partial specializations for __m256d and for __m128 unions.

*Note: std::printf in the loop is only for the debugging purpose.

for (std::size_t i{ 0 }; i != vecLength; i += 4)
	{
		do
		vrv1 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		while (!(_mm256_cmp_pd(vrv1, _mm256_setzero_pd(), 0).m256d_f64));

		__m256d vrv2 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		__m256d temp1 = _mm256_set_pd(this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[0]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[1]),
			this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[2]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[3]));
		__m256d vvu1 = _mm256_mul_pd(_mm256_sqrt_pd(_mm256_mul_pd(_mm256_set1_pd(-2.0), _mm256_log_pd(vrv1))), temp1);
		__m256d vvr2 = _mm256_mul_pd(_mm256_add_pd(_mm256_set1_pd(this->m_dMean), _mm256_sqrt_pd(_mm256_set1_pd(this->m_dVariance))), vvu1);
		std::printf("v0=%.9f,v1=%.9f,v2=%.9f,v3=%.9f\n", vvr2.m256d_f64[0], vvr2.m256d_f64[1], vvr2.m256d_f64[2], vvr2.m256d_f64[3]);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).first, vvu1);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).second, vvr2);

	}

 

0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
To Ilyapolak. Here are a couple of answers to your questions... >>I have one question related to using TDataSet class templated on type argument and on data size. First did you >>try to use blaze or blitz++ library (blaze being faster) for your data containers? No and I even do not use STL in codes on the project because all these C++ libraries create additional overheads. When you have more than 1 giga single-precision elements in a data set pure C codes work faster.
0 Kudos
SergeyKostrov
Valued Contributor II
567 Views
>>Second did you encounter any problems related to optimization of that TDataSet templated class when >>compared to free standing dynamically allocated arrays? No. Everything looks good and no optimization problems detected.
0 Kudos
SergeyKostrov
Valued Contributor II
545 Views
>>Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual >>vectorization? Partially Yes. But, there are so many Intel Intrinsic domains, like MMX, SSE, SSE2, SSE4, AVX, AVX2, etc, that full support of all these Intrinsic domains is absolutely useless. For example, nobody is interested in MMX or SSE at the moment. Also, a current state of Intel Intrinsic domains I would consider as a very messy and there are lots of inconsistencies. In my thread devoted to Intergration of Watcom C++ compiler I've already expressed my point of view that direct usage of Intel Intrinsics does not solve all performance problems.
0 Kudos
SergeyKostrov
Valued Contributor II
545 Views
[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768
0 Kudos
Reply