- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
(My question abot ISA-Extension is near the bottom of post)
today I has found a old piece of code, done time profiling and ....was nearly fallen from the chair.
What's happen? This very smal piece of code consumes a lot of time, much more as was expected by me and therefore was never profiled before.
for (int i = 0; i < n; i++)
{
objectLabels = labelsMap[objectLabels];
}
int* objectLabels and int* labelsMap are 32-byte aligned arrays declared as object members, and created with malloc during object construction.
Parallelizing loop with OpenMP makes it reayll very slow, what wonders me too again. But, in order to do that, I've "renamed" some variables in order to declare it shared. Because OpenMP does not help (at the end I done it withs parallel_invoke manually and get some better execution timing) eiher I has my originall sequential loop with "renamed" varibales:
int* _sh_objectLabels = objectLabels;
int* _sh_labelsMap = labelsMap;
for (int i = 0; i < n; i++)
{
_sh_objectLabels = _sh_labelsMap[_sh_objectLabels];
}
NOW, some wonder happens - this version performs 50%-60% better as version above - but this is exactly the same! How is this possible?
Btw. This version performs with nearly same performance:
int* _sh_objectLabels = objectLabels;
int* _sh_labelsMap = labelsMap;
register int* _sh_objectLabels_ptr = _sh_objectLabels;
register int _sh_objectLabels_val;
register int _sh_labelsMap_val;
for (int i = 0; i < N; i++)
{
_sh_objectLabels_val = *_sh_objectLabels_ptr;
_sh_labelsMap_val = *(_sh_labelsMap + _sh_objectLabels_val);
*_sh_objectLabels_ptr = _sh_labelsMap_val;
++_sh_objectLabels_ptr;
}
[ Compiler: Microsoft VS 2015.3 ]
My two questions:
1. Why it is generally so exorbital slow? Does this has something with processor jump-logic (here is no one jump!), memory access, cache, chipset, etc?
2. Is it possible to speedup this code with SSE/AVX/AVX2? Can new gather/scatter help here in some way?
Many thanks in advance!
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How large is the objectLabels array? Since you want to parallelize it, I hope it's large. If it *is* large, and you just allocated it, it's likely that the real cost here is in page-faults to populate it. (At least, that would be true on Linux :-)) If that is the case the details of the instructions in the loop are utterly irrelevant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The array is large, but not too huge. Why shouuld page faults be transmitted on each memory access and not only on boundaries crossing? I've some algorithms with 5 input arrays, which are all much bigger, but execution time is lower. Something here is very strange.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the problem can be caused by poor predictability of the memory accesses. Depending on the values in objectLabels (before the loop), the loop may result in an effectively random access pattern over labelsMap, which would hamper hardware prefetching.
The loop also can't be easily parallelized because objectLabels and labelsMap may overlap. Pre-loop checks cannot resolve this problem because accesses through labelsMap are not bounded.
If the data in labelsMap can be represented as a function, I think it would be beneficial to use that function here instead of the array. If that's not an option, maybe you can try changing the data to ensure a more predictable memory access pattern. The problem with a possible overlap can be tackled by using different types for the arrays (in C++ that engages strict aliasing rules, which prohibit such arrays from overlapping). Using gather instructions from AVX2 might help, but I wouldn't expect any significant speedup from them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
thanks for your suggestions! objectLabels and labelsMap does not overlap but the content is nearly random and can't be represented by an function. What is mean by "strict aliasing in the usual way"? As I seen, the most time consumption part is labelsMap[objectLabels], so simd stream store may help only a little bit, but it is a good suggestion to make this, if other (right) part may be improved in some significant way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
You chose a compiler which doesn't implement strict aliasing in the usual way, but apparently you disposed of that problem by introducing local pointers.
Local pointers have no effect on the strict aliasing rules or the possibility of overlap. Let me put it in examples:
int foo(int* p1, int* p2) { *p1 = 10; return *p2; }
The compiler cannot know if p1 and p2 overlap (i.e. in this case, point to the same int variable). Introducing proxy pointers does not change that:
int foo(int* p1, int* p2) { int* q1 = p1; int* q2 = p2; *q1 = 10; return *q2; }
Here, all four pointers can point at the same location, so the compiler must assume the read from q2 can be affected by the prior write to q1. The situation changes when the pointers have different types:
long foo(int* p1, long* p2) { *p1 = 10; return *p2; }
Here, p1 and p2 cannot point to the same location because that is prohibited by the C++ aliasing rule: two objects of different types int and long cannot occupy the same storage location. Hence the compiler can be certain that the read from p2 is not affected by write to p1 and may reorder these operations at will.
It is true that MSVC does not employ strict aliasing rule for optimizations (at least, it doesn't according to my knowledge), but other compilers do.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page