- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ravi,
You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().
In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".
I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().
Please note the line:
char * volatile src = _src;
volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.
Call with, say, (src + 1) to test unaligned access.
===================================================
__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;
__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
__m256i mm1 = _mm256_xor_si256(mm11, mm21);
__m256i mm2 = _mm256_xor_si256(mm12, mm22);
__m256i mm = _mm256_or_si256(mm1, mm2);
return !_mm256_testz_si256(mm, mm);
}
void test_cmp()
{
__declspec(align(64)) char _src[200];
char * volatile src = _src;
double time;
int i, result = 0;
for (i = 0; i < sizeof(_src); i++)
_src = i;
time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);
pl("time: %.3f, %d", vx_time(time), result);
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you tend to compare large regions I would recommend trying to amortize for vptest latency. You can perform multiple XORs and then combine the results with ORs and perform a single test on the combined result.
int my_memcmp(const void* src_1, const void* src_2, size_t size) { const __m256i* src1 = (const __m256i*)src_1; const __m256i* src2 = (const __m256i*)src_2; const size_t n = size / 64u; for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2) { __m256i mm11 = _mm256_lddqu_si256(src1); __m256i mm12 = _mm256_lddqu_si256(src1 + 1); __m256i mm21 = _mm256_lddqu_si256(src2); __m256i mm22 = _mm256_lddqu_si256(src2 + 1); __m256i mm1 = _mm256_xor_si256(mm11, mm21); __m256i mm2 = _mm256_xor_si256(mm12, mm22); __m256i mm = _mm256_or_si256(mm1, mm2); if (unlikely(!_mm256_testz_si256(mm, mm))) { // Find out which of the two 32-byte blocks are different if (_mm256_testz_si256(mm1, mm1)) { mm11 = mm12; mm21 = mm22; mm1 = mm2; } // Produce the comparison result __m256i mm_cmp = _mm256_cmpgt_epi8(mm21, mm11); __m256i mm_rcmp = _mm256_cmpgt_epi8(mm11, mm21); mm_cmp = _mm256_xor_si256(mm1, mm_cmp); mm_rcmp = _mm256_xor_si256(mm1, mm_rcmp); uint32_t cmp = _mm256_movemask_epi8(mm_cmp); uint32_t rcmp = _mm256_movemask_epi8(mm_rcmp); cmp = (cmp - 1u) ^ cmp; rcmp = (rcmp - 1u) ^ rcmp; return (int32_t)rcmp - (int32_t)cmp; } } // Compare tail bytes, if needed return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi,
In the initial post example, the return value indicates
0 == all equal
1 == all greater
-1== don't know if value is greater or lesser (as there is a mix of greater, lesser, and equal).
What are the probability of the returned values? You might want to tune your tests and return conditions based on this.
Note, arbitrary string matches (unsorted text), would almost always return -1. The you would need to perform additional tests to determine if the text string is greater or lesser.
If this is the case (-1 preponderance and leading to additional test), then I recommend performing the cmpgt and cmplt tests, running the movemask on each, then performing an unsigned 32-bit compare, the greater value will indicate which cmp?t was larger (if 0 then the two were equal). This would eliminate the cmpeq test as well as eliminating additional processing as you may be doing now when -1 is returned.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ravi,
You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().
In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".
I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().
Please note the line:
char * volatile src = _src;
volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.
Call with, say, (src + 1) to test unaligned access.
===================================================
__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;
__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
__m256i mm1 = _mm256_xor_si256(mm11, mm21);
__m256i mm2 = _mm256_xor_si256(mm12, mm22);
__m256i mm = _mm256_or_si256(mm1, mm2);
return !_mm256_testz_si256(mm, mm);
}
void test_cmp()
{
__declspec(align(64)) char _src[200];
char * volatile src = _src;
double time;
int i, result = 0;
for (i = 0; i < sizeof(_src); i++)
_src = i;
time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);
pl("time: %.3f, %d", vx_time(time), result);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir is spot on.
If you want to test the performance of a function, a) do not include the setup/initialization in the timing, and b) make sure that the compiler does not out smart you by eliminating code that appears to produce unused results.
Situation a) when not observed, is particularly puzzling when testing parallel programmers because most noobs are not aware that random number generators are serializing (have critical section). And this throws off any timing runs.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As to random numbers: I always create small test vectors and fill them with random data and then start the test.Results can be stored as you said in volatile variables. I never heard of a compiler eliminating volatile results, but the other way it could be. Secondly, check the generated assembly, nontheless whether results are expected or unexpected. The interesting part you will code, should not be to hard to understand with some training.
It might also be a good idea to run the test several times. Especially, if this is the real usage scenario later. Use the same data again, this way you get some cache warm up and if data is small enought to fit in L1 you will eliminate memory bandwith at least to a certain extent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Christian,
It's important that volatile is the attribute of a pointer, not variable or array it points to.
The pointer actually might never be changed.
This is much better because one don't need to care much about the attribute and contents of the variable.
Besides that, VC doesn't allow volatile __mXXX vars.
Compiler cannot skip access with such a pointer since it can't predict its value.
The only thing remaining is you need to print some result of calculation.
Unfortunately, I figured out that just recently. Before that, I also used volatile vars.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
thanks for the info!
I read my post again, and realized I wrote a little bit unclear what I meant: I assumed a case where you have several values, or a whole result array - still small for the test. Then I always generate a random index in the range of the output array and read this item to a voltatile variable. But this var is of type float, double depending on the prior AVX, SSE code. This way you need no printing. And the compiler can't omit any write to result array, as there is at least one read afterwards, but at an uncertain position. At least, this is what I think and I never had problems this way.
What you mentioned sounds interesing, too. You use pointer to _m128 opr _m256 types and make the pointer itselft volatile right? This seems more universal. But instead of printing you could try, declaring a volatile float var and it with a _mm_store_ss/sd. I mean if the pointer is still volatile, the compiler can't optimize it and if you store on element of the vector type in a normal volatile variable, this can't be optimized either. What do you think?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Christian,
You can replace any random access by an access with volatile pointer (or index) to avoid optimization.
If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.
If you're storing (even a const) with a volatile pointer to a var (even non-volatile), compiler has to really store because it does not know the value of the pointer in advance. It should assume it is anywhere in the memory.
In other words, compiler never omits any write operator with volatile pointer no matter how simple it is to optimize otherwise.
The only obvious exception is when you don't use the result.
Another approach could be:
for (...)
x ^= foo(...);
print(x);
Compiler must do all the calls to combine all results.
Random access is not a good idea since it takes time to call rand() and it is much slower compared with regular sequential access.
As a result you could get completely unrealistic timing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
that's a good point. The thing with the loop, is something I know. Therefore I added the result array, where each result has its own location. So the compiler can't do any skipping of iterations. And yes this adds memory bandwith issues, but if it is the way it is used in reality or your programm, this definitely is legitim. Next thing is that the random access appeared after timing the test, and only once. So I think additional affort did at least not affect test timings. And you can replace print (x) with volatile tmp = x. The compiler can't optimize it, as this is the last and only write.
But I like your volatile pointer approch, seems a good idea to use from now on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.
Volatile stores/loads cannot be omitted. It doesn't matter whether the variable itself is volatile or the pointer is to a volatile.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
> If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.
Volatile stores/loads cannot be omitted. It doesn't matter whether the variable itself is volatile or the pointer is to a volatile.
This is also, what I thought it to be.
So the question is, whether there is a real example demonstrating the opposite for volatile variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page