Solved: Analysis of 128-bit Streaming store codes vs. Non Streaming store codes - Page 2

SergeyKostrov · ‎02-06-2016

*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***

SergeyKostrov · ‎02-07-2016

[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

View solution in original post

SergeyKostrov · ‎02-09-2016

This is how a main body of the Test-case looks like: ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); RTm128float *pfDataA = ( RTm128float * )tdsA.GetData1D(); RTm128float **ppfDataA = ( RTm128float ** )tdsA.GetData2D(); ... Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class.

SergeyKostrov · ‎02-09-2016

As a part of the future investigation a possible negative impact of the Virtual Memory ( since the data set is very big ) needs to be taken into account.

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.578 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.890 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 1.281 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.750 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.844 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-09-2016

[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 22.047 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.359 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.563 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 0.812 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.109 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.188 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 21.797 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...

SergeyKostrov · ‎02-11-2016

[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.1GB ] [ Processing Completed in: 39.922 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8704, 8704 ); tdsA.SetValue( v ); ...

Bernard · ‎03-04-2016

Sergey Kostrov wrote:

[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]

...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...

Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.

Regarding this example I think that CPU scheduler/dispatcher will dispatch fused cmp/jne uop(s) to Port5 for the branch evaluation. It seems that this will be recognized as a backward branch which usually is taken, so the at the same time 1 non-temporal memory store of 16-byte can be issued. Now what is really interested how internally AGU is involved in address computation. In this example I think that DTLB buffer can store recent virtual-to-physical memory mapping so I think that AGU can access that cache and calculate address directly maybe without waiting for the branch evaluation.

Bernard · ‎03-04-2016

>>>Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class>>>

Interesting test cases.

I have one question related to using TDataSet class templated on type argument and on data size. First did you try to use blaze or blitz++ library (blaze being faster) for your data containers? Second did you encounter any problems related to optimization of that TDataSet templated class when compared to free standing dynamically allocated arrays? Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual vectorization?

Here is short example of main computation loop of Gaussian-like Noise being vectorized. This is a part of class which is partially specialized on __m256d union template argument. So primary template accepts scalar typename T argument and is followed by two partial specializations for __m256d and for __m128 unions.

*Note: std::printf in the loop is only for the debugging purpose.

for (std::size_t i{ 0 }; i != vecLength; i += 4)
	{
		do
		vrv1 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		while (!(_mm256_cmp_pd(vrv1, _mm256_setzero_pd(), 0).m256d_f64));

		__m256d vrv2 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		__m256d temp1 = _mm256_set_pd(this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[0]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[1]),
			this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[2]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[3]));
		__m256d vvu1 = _mm256_mul_pd(_mm256_sqrt_pd(_mm256_mul_pd(_mm256_set1_pd(-2.0), _mm256_log_pd(vrv1))), temp1);
		__m256d vvr2 = _mm256_mul_pd(_mm256_add_pd(_mm256_set1_pd(this->m_dMean), _mm256_sqrt_pd(_mm256_set1_pd(this->m_dVariance))), vvu1);
		std::printf("v0=%.9f,v1=%.9f,v2=%.9f,v3=%.9f\n", vvr2.m256d_f64[0], vvr2.m256d_f64[1], vvr2.m256d_f64[2], vvr2.m256d_f64[3]);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).first, vvu1);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).second, vvr2);

	}

SergeyKostrov · ‎05-19-2016

To Ilyapolak. Here are a couple of answers to your questions... >>I have one question related to using TDataSet class templated on type argument and on data size. First did you >>try to use blaze or blitz++ library (blaze being faster) for your data containers? No and I even do not use STL in codes on the project because all these C++ libraries create additional overheads. When you have more than 1 giga single-precision elements in a data set pure C codes work faster.

SergeyKostrov · ‎05-19-2016

>>Second did you encounter any problems related to optimization of that TDataSet templated class when >>compared to free standing dynamically allocated arrays? No. Everything looks good and no optimization problems detected.

SergeyKostrov · ‎05-19-2016

>>Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual >>vectorization? Partially Yes. But, there are so many Intel Intrinsic domains, like MMX, SSE, SSE2, SSE4, AVX, AVX2, etc, that full support of all these Intrinsic domains is absolutely useless. For example, nobody is interested in MMX or SSE at the moment. Also, a current state of Intel Intrinsic domains I would consider as a very messy and there are lots of inconsistencies. In my thread devoted to Intergration of Watcom C++ compiler I've already expressed my point of view that direct usage of Intel Intrinsics does not solve all performance problems.

SergeyKostrov · ‎08-05-2016

[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768