Slightly OT, but maybe somebody has an idea.

Alexander_L_1 · ‎02-06-2017

Hi,

(My question abot ISA-Extension is near the bottom of post)

today I has found a old piece of code, done time profiling and ....was nearly fallen from the chair.

What's happen? This very smal piece of code consumes a lot of time, much more as was expected by me and therefore was never profiled before.

for (int i = 0; i < n; i++)
{
objectLabels = labelsMap[objectLabels];
}

int* objectLabels and int* labelsMap are 32-byte aligned arrays declared as object members, and created with malloc during object construction.

Parallelizing loop with OpenMP makes it reayll very slow, what wonders me too again. But, in order to do that, I've "renamed" some variables in order to declare it shared. Because OpenMP does not help (at the end I done it withs parallel_invoke manually and get some better execution timing) eiher I has my originall sequential loop with "renamed" varibales:

int* _sh_objectLabels = objectLabels;
int* _sh_labelsMap = labelsMap;
for (int i = 0; i < n; i++)
{
_sh_objectLabels = _sh_labelsMap[_sh_objectLabels];
}

NOW, some wonder happens - this version performs 50%-60% better as version above - but this is exactly the same! How is this possible?

Btw. This version performs with nearly same performance:

   int* _sh_objectLabels = objectLabels;
   int* _sh_labelsMap = labelsMap;
   register int* _sh_objectLabels_ptr = _sh_objectLabels;
   register int _sh_objectLabels_val;
   register int _sh_labelsMap_val;
   for (int i = 0; i < N; i++)
       {
           _sh_objectLabels_val = *_sh_objectLabels_ptr;
           _sh_labelsMap_val = *(_sh_labelsMap + _sh_objectLabels_val);
           *_sh_objectLabels_ptr = _sh_labelsMap_val;
           ++_sh_objectLabels_ptr;
       }

[ Compiler: Microsoft VS 2015.3 ]

My two questions:

1. Why it is generally so exorbital slow? Does this has something with processor jump-logic (here is no one jump!), memory access, cache, chipset, etc?
2. Is it possible to speedup this code with SSE/AVX/AVX2? Can new gather/scatter help here in some way?

Many thanks in advance!

James_C_Intel2 · ‎02-07-2017

How large is the objectLabels array? Since you want to parallelize it, I hope it's large. If it *is* large, and you just allocated it, it's likely that the real cost here is in page-faults to populate it. (At least, that would be true on Linux :-)) If that is the case the details of the instructions in the loop are utterly irrelevant.

Alexander_L_1 · ‎02-07-2017

The array is large, but not too huge. Why shouuld page faults be transmitted on each memory access and not only on boundaries crossing? I've some algorithms with 5 input arrays, which are all much bigger, but execution time is lower. Something here is very strange.

andysem · ‎02-07-2017

I think the problem can be caused by poor predictability of the memory accesses. Depending on the values in objectLabels (before the loop), the loop may result in an effectively random access pattern over labelsMap, which would hamper hardware prefetching.

The loop also can't be easily parallelized because objectLabels and labelsMap may overlap. Pre-loop checks cannot resolve this problem because accesses through labelsMap are not bounded.

If the data in labelsMap can be represented as a function, I think it would be beneficial to use that function here instead of the array. If that's not an option, maybe you can try changing the data to ensure a more predictable memory access pattern. The problem with a possible overlap can be tackled by using different types for the arrays (in C++ that engages strict aliasing rules, which prohibit such arrays from overlapping). Using gather instructions from AVX2 might help, but I wouldn't expect any significant speedup from them.

TimP · ‎02-07-2017

You chose a compiler which doesn't implement strict aliasing in the usual way, but apparently you disposed of that problem by introducing local pointers. By the way, strict aliasing is covered by the original c++ standard. Even with assertions and without the nasty dependency, your compiler would never attempt autovectorization of gather. You should see an advantage if you can invoke simd store, if that doesn't violate dependency pattern. That might reduce memory bandwidth enough for threading to help. Avx2 gather instructions would simplify low level coding but possibly have less effect on performance.

Alexander_L_1 · ‎02-07-2017

Hello ,

thanks for your suggestions! objectLabels and labelsMap does not overlap but the content is nearly random and can't be represented by an function. What is mean by "strict aliasing in the usual way"? As I seen, the most time consumption part is labelsMap[objectLabels], so simd stream store may help only a little bit, but it is a good suggestion to make this, if other (right) part may be improved in some significant way.

TimP · ‎02-07-2017

The typed aliasing provision of c and c++ allows a compiler to assume that pointers and ints don't overlap. According to the docs, intel c++ on windows doesn't take advantage of this until Qansi_alias is set, because Microsoft c++ provides for possible violation of this standard.

andysem · ‎02-07-2017

Tim P. wrote:

You chose a compiler which doesn't implement strict aliasing in the usual way, but apparently you disposed of that problem by introducing local pointers.

Local pointers have no effect on the strict aliasing rules or the possibility of overlap. Let me put it in examples:

int foo(int* p1, int* p2)
{
  *p1 = 10;
  return *p2;
}

The compiler cannot know if p1 and p2 overlap (i.e. in this case, point to the same int variable). Introducing proxy pointers does not change that:

int foo(int* p1, int* p2)
{
  int* q1 = p1;
  int* q2 = p2;
  *q1 = 10;
  return *q2;
}

Here, all four pointers can point at the same location, so the compiler must assume the read from q2 can be affected by the prior write to q1. The situation changes when the pointers have different types:

long foo(int* p1, long* p2)
{
  *p1 = 10;
  return *p2;
}

Here, p1 and p2 cannot point to the same location because that is prohibited by the C++ aliasing rule: two objects of different types int and long cannot occupy the same storage location. Hence the compiler can be certain that the read from p2 is not affected by write to p1 and may reorder these operations at will.

It is true that MSVC does not employ strict aliasing rule for optimizations (at least, it doesn't according to my knowledge), but other compilers do.