Software Archive
Read-only legacy content
17061 ディスカッション

Analysis of 128-bit Streaming store codes vs. Non Streaming store codes

SergeyKostrov
高評価コントリビューター II
4,879件の閲覧回数
*** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
0 件の賞賛
1 解決策
SergeyKostrov
高評価コントリビューター II
4,867件の閲覧回数
[ Conclusion ] To my surprise a performance improvement, in a range from ~1.0% to ~1.8%, was insignificant and additional investigation is needed on how initialization of the data set could be improved by 5%, or 7%, or even more, and I think an OpenMP threading needs to be considered first of all.

元の投稿で解決策を見る

41 返答(返信)
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
This is how a main body of the Test-case looks like: ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); RTm128float *pfDataA = ( RTm128float * )tdsA.GetData1D(); RTm128float **ppfDataA = ( RTm128float ** )tdsA.GetData2D(); ... Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class.
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
As a part of the future investigation a possible negative impact of the Virtual Memory ( since the data set is very big ) needs to be taken into account.
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.578 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.890 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 1.281 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.750 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.844 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ Non Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 22.047 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 256MB ] [ Processing Completed in: 0.359 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 4096, 4096 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 400MB ] [ Processing Completed in: 0.563 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 5120, 5120 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 576MB ] [ Processing Completed in: 0.812 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 6144, 6144 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 784MB ] [ Processing Completed in: 1.109 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7168, 7168 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 900MB ] [ Processing Completed in: 5.188 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 7680, 7680 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.0GB ] [ Processing Completed in: 21.797 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8192, 8192 ); tdsA.SetValue( v ); ...
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
[ 128-bit Streaming Stores processing ( 32-bit platform ) ] [ Data Set Size = 1.1GB ] [ Processing Completed in: 39.922 secs ] ... RTm128 v = { 10.0f, 11.0f, 12.0f, 13.0f }; TDataSet< RTm128, DATATYPE_RTFLOAT > tdsA; tdsA.SetSize( 8704, 8704 ); tdsA.SetValue( v ); ...
Bernard
高評価コントリビューター I
1,697件の閲覧回数

Sergey Kostrov wrote:

[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]

...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...

Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.

Regarding this example I think that CPU scheduler/dispatcher will dispatch fused cmp/jne uop(s) to Port5 for the branch evaluation. It seems that this will be recognized as a backward branch which usually is taken, so the at the same time 1 non-temporal memory store of  16-byte can be issued. Now what is really interested how internally AGU is involved in address computation. In this example I think that DTLB buffer can store recent virtual-to-physical memory mapping so I think that AGU can access that cache and calculate address directly maybe without waiting for the branch evaluation.

Bernard
高評価コントリビューター I
1,697件の閲覧回数

>>>Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class>>>

Interesting test cases.

I have one question related to using TDataSet class templated on type argument and on data size. First did you try to use blaze or blitz++ library (blaze being faster) for your data containers? Second did you encounter any problems related to optimization of that TDataSet templated class when compared to free standing dynamically allocated arrays? Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual vectorization?

Here is short example of main computation loop of  Gaussian-like Noise being vectorized. This is  a part of class which is  partially specialized on __m256d union template argument. So primary template accepts scalar typename T argument and is followed by two partial specializations for __m256d and for __m128 unions.

*Note: std::printf in the loop is only for the debugging purpose.

for (std::size_t i{ 0 }; i != vecLength; i += 4)
	{
		do
		vrv1 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		while (!(_mm256_cmp_pd(vrv1, _mm256_setzero_pd(), 0).m256d_f64));

		__m256d vrv2 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen());
		__m256d temp1 = _mm256_set_pd(this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[0]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[1]),
			this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[2]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[3]));
		__m256d vvu1 = _mm256_mul_pd(_mm256_sqrt_pd(_mm256_mul_pd(_mm256_set1_pd(-2.0), _mm256_log_pd(vrv1))), temp1);
		__m256d vvr2 = _mm256_mul_pd(_mm256_add_pd(_mm256_set1_pd(this->m_dMean), _mm256_sqrt_pd(_mm256_set1_pd(this->m_dVariance))), vvu1);
		std::printf("v0=%.9f,v1=%.9f,v2=%.9f,v3=%.9f\n", vvr2.m256d_f64[0], vvr2.m256d_f64[1], vvr2.m256d_f64[2], vvr2.m256d_f64[3]);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).first, vvu1);
		_mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).second, vvr2);

	}

 

SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
To Ilyapolak. Here are a couple of answers to your questions... >>I have one question related to using TDataSet class templated on type argument and on data size. First did you >>try to use blaze or blitz++ library (blaze being faster) for your data containers? No and I even do not use STL in codes on the project because all these C++ libraries create additional overheads. When you have more than 1 giga single-precision elements in a data set pure C codes work faster.
SergeyKostrov
高評価コントリビューター II
1,697件の閲覧回数
>>Second did you encounter any problems related to optimization of that TDataSet templated class when >>compared to free standing dynamically allocated arrays? No. Everything looks good and no optimization problems detected.
SergeyKostrov
高評価コントリビューター II
1,675件の閲覧回数
>>Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual >>vectorization? Partially Yes. But, there are so many Intel Intrinsic domains, like MMX, SSE, SSE2, SSE4, AVX, AVX2, etc, that full support of all these Intrinsic domains is absolutely useless. For example, nobody is interested in MMX or SSE at the moment. Also, a current state of Intel Intrinsic domains I would consider as a very messy and there are lots of inconsistencies. In my thread devoted to Intergration of Watcom C++ compiler I've already expressed my point of view that direct usage of Intel Intrinsics does not solve all performance problems.
SergeyKostrov
高評価コントリビューター II
1,675件の閲覧回数
[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768
返信