- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
[ MinGW C++ compiler assembler codes - with 128-bit Streaming Stores ]
...
00403520 movntps xmmword ptr [eax], xmm5
00403523 add eax, 40h
00403526 movntps xmmword ptr [eax-30h], xmm5
0040352A movntps xmmword ptr [eax-20h], xmm5
0040352E movntps xmmword ptr [eax-10h], xmm5
00403532 cmp eax, ecx
00403534 jne _ZN8CDataSet7RunTestEv+2D0h (403520h)
...Note: By the way, all C++ compilers use interleave technique ( some call it as alternating operations ) when generating binary codes to get the best from CPU pipelining.
Regarding this example I think that CPU scheduler/dispatcher will dispatch fused cmp/jne uop(s) to Port5 for the branch evaluation. It seems that this will be recognized as a backward branch which usually is taken, so the at the same time 1 non-temporal memory store of 16-byte can be issued. Now what is really interested how internally AGU is involved in address computation. In this example I think that DTLB buffer can store recent virtual-to-physical memory mapping so I think that AGU can access that cache and calculate address directly maybe without waiting for the branch evaluation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Streaming and Non Streaming functionality is implemented in different versions of TDataSet::SetValue method of the TDataSet C++ template class>>>
Interesting test cases.
I have one question related to using TDataSet class templated on type argument and on data size. First did you try to use blaze or blitz++ library (blaze being faster) for your data containers? Second did you encounter any problems related to optimization of that TDataSet templated class when compared to free standing dynamically allocated arrays? Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual vectorization?
Here is short example of main computation loop of Gaussian-like Noise being vectorized. This is a part of class which is partially specialized on __m256d union template argument. So primary template accepts scalar typename T argument and is followed by two partial specializations for __m256d and for __m128 unions.
*Note: std::printf in the loop is only for the debugging purpose.
for (std::size_t i{ 0 }; i != vecLength; i += 4) { do vrv1 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen()); while (!(_mm256_cmp_pd(vrv1, _mm256_setzero_pd(), 0).m256d_f64)); __m256d vrv2 = _mm256_set_pd(rand_gen(), rand_gen(), rand_gen(), rand_gen()); __m256d temp1 = _mm256_set_pd(this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[0]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[1]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[2]), this->m_oWaveformGenerator(2.0 * PI * vrv2.m256d_f64[3])); __m256d vvu1 = _mm256_mul_pd(_mm256_sqrt_pd(_mm256_mul_pd(_mm256_set1_pd(-2.0), _mm256_log_pd(vrv1))), temp1); __m256d vvr2 = _mm256_mul_pd(_mm256_add_pd(_mm256_set1_pd(this->m_dMean), _mm256_sqrt_pd(_mm256_set1_pd(this->m_dVariance))), vvu1); std::printf("v0=%.9f,v1=%.9f,v2=%.9f,v3=%.9f\n", vvr2.m256d_f64[0], vvr2.m256d_f64[1], vvr2.m256d_f64[2], vvr2.m256d_f64[3]); _mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).first, vvu1); _mm256_storeu_pd(&this->m_oAWGNModSine.operator[](i).second, vvr2); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »