Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Dmitry_Vyukov
Valued Contributor I
49 Views

Possible optimization of concurrent_vector::operator[]()

When I was working out details of distributed reference counting scheme (http://groups.google.com/group/lock-free/browse_frm/thread/46d00a0f543a758e) I needed a component which is basically similar to tbb::concurrent_vector, i.e. concurrent subscripting and growth. Self-established goal was to achieve lowest possible overhead and maximum possibility of inlining of subscripting, because subscripting is invocked on every reference counting operation.

The variant I came out is:
[bash]int subscript1(int** root, size_t idx)
{
    unsigned long high_bit;
    _BitScanReverse(&high_bit, idx);
    return root[high_bit][idx];
}[/bash]


While tbb::concurrent_vector implementation is:
[bash]int subscript2(int** root, size_t idx)
{
    unsigned long high_bit;
    _BitScanReverse(&high_bit, idx);
    idx -= (1 << high_bit) & ~1;
    return root[high_bit][idx];
}[/bash]


According compiler code is:
[bash]    while (x)
        x += subscript1(root, y);
00401014 0F BD D0         bsr         edx,eax 
00401017 89 54 24 08      mov         dword ptr [esp+8],edx 
0040101B 8B 14 96         mov         edx,dword ptr [esi+edx*4] 
0040101E 03 0C 82         add         ecx,dword ptr [edx+eax*4] 
00401021 75 F1            jne         main+14h (401014h) [/bash]


[bash]    while (x)
        x += subscript2(root, y);
00401015 0F BD C8         bsr         ecx,eax 
00401018 BF 01 00 00 00   mov         edi,1 
0040101D D3 E7            shl         edi,cl 
0040101F 8B D8            mov         ebx,eax 
00401021 89 4C 24 0C      mov         dword ptr [esp+0Ch],ecx 
00401025 8B 0C 8E         mov         ecx,dword ptr [esi+ecx*4] 
00401028 83 E7 FE         and         edi,0FFFFFFFEh 
0040102B 2B DF            sub         ebx,edi 
0040102D 03 14 99         add         edx,dword ptr [ecx+ebx*4] 
00401030 75 E3            jne         main+15h (401015h)[/bash]


As you may see, my variant cuts shift, bitwise and and substraction instructions, as well increases possibility of inlining, plus a bit smaller preassure on register file.
The trick is to direct segment pointer N elements before actual memory block. This way one does not need to do that "(1 << high_bit) & ~1" offset every time. Each segment pointer points to where array would begin if all the data would situated in single continous memory block. I.e. if first N segments allocated within single memory block, first N segment pointers will be equal.

Are you interested in incorporating this into TBB? However, I suspect there may be some problems with binary interface compatibility, right?

0 Kudos
3 Replies
Anton_M_Intel
Employee
49 Views

Quoting Dmitriy Vyukov
The trick is to direct segment pointer N elements before actual memory block. This way one does not need to do that "(1 << high_bit) & ~1" offset every time. Each segment pointer points to where array would begin if all the data would situated in single continous memory block. I.e. if first N segments allocated within single memory block, first N segment pointers will be equal.

Are you interested in incorporating this into TBB? However, I suspect there may be some problems with binary interface compatibility, right?

Yes, it is a good idea, thank you! And yes, binary compatibility may be a problem but we know how to solve it :) Probably, we could incorporate it some time after 3.0.

ARCH_R_Intel
Employee
49 Views

Nice trick! As Anton noted, we won't have time to get it into our next release but we're surely look into adding it later.

Dmitry_Vyukov
Valued Contributor I
49 Views

I am happy you like it. ConcRT guys just rejected my comment with the suggestion. They copied concurrent_vector from TBB, right? Ok, it's their problem. I think it's nice to have primary method of primary data structure in fundamental library as fast as possible, even if we are talking about a handful of instructions.

Reply