Solved: Request For Concurrent Dynamic bitset

ksgokul · ‎11-05-2010

Hi,

I have a requirement to implement a concurrent bitset which should should be dynamically extensible. I am currently planning to use tbb::concurrent_vector in conjuction with tbb::atomic as a dynamic bitset and setting the bit using fetch_and_increment operation ( increment based on the bit position ).

I would like to make a request to consider a container like concurrent_bitset for a future release, so that similar implementations can benefit from it.

Thanks,

Gokul.

jimdempseyatthecove · ‎11-09-2010

The compare and swap will work, but it is heavier weight than necessary for these bit functions.
Plus you would like compatability with the standard integer operator functions.
( <<=, >>=, *=, /= will require using compare and swap)

Look in the TBB package for the header containing the atomic class. In there you will find

value_type

operator+=( D addend ) {

return fetch_and_add(addend)+addend;

}

This will be some amount of work on your part to add

fetch_and_or
fetch_and_and
fetch_and_xor

And the associated operator functions

operator|=()
operator&=()
operator^=()

This will also require extending the appropriate file referenced by tbb_machine.h (e.g. windows_intel64.h)

The windows_intel64.h has

__TBB_machine_OR

__TBB_machine_AND

but not

__TBB_machine_XOR

The above are defined as returning void, but they should be returning the original value.

Frankly I am surprised these operators are not in the TBB headers already.

The reason you do not want to use compare and swap (which will work) is these functions do not require it.
Note, you could use compare and swap to perform an ADD, but it is not necessary to do so.

For:

ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG

you can use the instructions containing a LOCK prefix. And as a result, the code will always succeed on the first attempt.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎11-07-2010

Check your

tbb::atomic

to see if it has

|= or
^= xor
&= and

If it does not, then add these functions.

Bit set uses OR= with bit in position
Bit clear uses AND= with ~(with bit in position)
Bit add uses XOR= with bit in position
Bit test uses non-modifying AND with bit in position

Jim Dempsey

ksgokul · ‎11-08-2010

Hi Jim,

tbb::atomic is not my data structure. It is the atomic data structure that comes with the TBB package. It doesn't have any documented |= and ^= and &= functions. currently i have implemented it using 'compare and swap' functions that come with atomic package.

Was i not clear?

Thanks,

Gokul.

jimdempseyatthecove · ‎11-09-2010

The compare and swap will work, but it is heavier weight than necessary for these bit functions.
Plus you would like compatability with the standard integer operator functions.
( <<=, >>=, *=, /= will require using compare and swap)

Look in the TBB package for the header containing the atomic class. In there you will find

value_type

operator+=( D addend ) {

return fetch_and_add(addend)+addend;

}

This will be some amount of work on your part to add

fetch_and_or
fetch_and_and
fetch_and_xor

And the associated operator functions

operator|=()
operator&=()
operator^=()

This will also require extending the appropriate file referenced by tbb_machine.h (e.g. windows_intel64.h)

The windows_intel64.h has

__TBB_machine_OR

__TBB_machine_AND

but not

__TBB_machine_XOR

The above are defined as returning void, but they should be returning the original value.

Frankly I am surprised these operators are not in the TBB headers already.

The reason you do not want to use compare and swap (which will work) is these functions do not require it.
Note, you could use compare and swap to perform an ADD, but it is not necessary to do so.

For:

ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG

you can use the instructions containing a LOCK prefix. And as a result, the code will always succeed on the first attempt.

Jim Dempsey

ksgokul · ‎11-09-2010

Hi Jim,

Thanks a lot for the guidance. I never knew they existed and my knowledge is restricted to the TBB tutorials. Can you please point me to some article/paper which i can read, before starting to use it ?

If you can provide me with the code, that would be even great. :)

Thanks,

Gokul.

jimdempseyatthecove · ‎11-10-2010

Gokul,

There is nothing in the tutorials that go to this dept of discussion. I suggest you read the header files and start with the AND or OR function (because the ...\machine\*.h files already contain the necessary underlaying functions. When those are working, then implement the XOR. Note, MSVC has _InterlockedXor and _InterlockedXor64 for 32-bit and 64-bit operands. A full implementation would handle BYTE and SHORT too.

>>If you can provide me with the code, that would be even great. :)

Ah grasshopper..., but then you would not have this excellentlearning experience....

Note, a fist attempt hack could add the operator??() functions and use the atomic compare and swap (less code to write). While this would work, it would also exemplify your coding style. A proper implementation would be something you can be proud of. Or usefule in your next job interview.

Jim Dempsey

RafSchietekat · ‎11-10-2010

You could even use template metaprogramming to select the right implementation for your processor.;-)

I wonder how much difference a CAS loop makes compared to a locked operation on x86. They're both very heavy compared to normal processing, and avoiding unnecessary use might amortise both implementations equally. Perhaps the difference only shows up in high levels of contention, where a CAS loop might need a relatively smart backoff strategy to guarantee progress for everybody. But that's for a toolkit writer to worry about, or for somebody who has to assure continuous proper operation in the field, probably not for a casual user who can always go in later to fix any problem when it arises.

jimdempseyatthecove · ‎11-10-2010

Raf,

The CAS has more overhead to setup and will retry when there is a concurrancy issue
***> pseudo code <***

// returns old value
longGetSet(long* p,longbit)
{
long old;
long mask = (1 << bit);
do {
old = *p;
long new = old | mask;
} while(_InterlockedCompareExchange((volatile long*)p,new, old) != old);

return old&mask; // 0=was not set, x=was set
}

verses

voidSet(long* p,longbit)
{
long mask = (1 << bit);
__asm {
mov eax, p
mov edxmask
LOCK
or dword ptr [eax], edx
}
}

Using the LOCK (_Interlocked...) you eliminate reading the old value and a potential retry on failure of CAS.
The consideration here is if you need the old value (e.g. were you the first thread to set the bit?)

Jim

RafSchietekat · ‎11-10-2010

I'm not convinced that the difference is significant in actual measurements.

Dmitry_Vyukov · ‎11-10-2010

In a concurrent environment bitsets inevitably have associated costs of atomic RMW operations. So in a concurrent environment you can consider *byte*sets as an alternative. Byteset will increase memory consumption 8 times, but at the same time decrease execution time ~30-100 times on IA-32 (I am assuming that you have moderate contention and the data fits into cache).

ksgokul · ‎11-12-2010

Just out of curiosity, if i use a Byteset, won't i be still facing the problem of contention because of load-store problems?

Thanks,

Gokul

Dmitry_Vyukov · ‎11-12-2010

Yes, it will face contention problem. However, false-sharing will be reduced 8 times. In the limit, you can use cache-line-set.

jimdempseyatthecove · ‎11-13-2010

Quoting ksgokul

Just out of curiosity, if i use a Byteset, won't i be still facing the problem of contention because of load-store problems?
Thanks,
Gokul

This depends on your coding requirements...

When your requirements are to only

set
clear (reset)
read

Then use of byte field in multi-threaded shared flags is best because a single write or single read can be done without LOCK prefex.

However, when your requirements are to

oldValue = set(NewValue)

Then you can use

LOCK XCHG al, byte ptr [ecx] ; ecx pointing to location (rcx if x64)

or

LOCK BTS dword ptr [ecx], eax ; ecx has location, eax has bit position 0:31
SBB eax, eax ; BTS set CF with old value of bit, produces eax-eax-CF (0/-1)

(use BTC for ^ (xor) and BTR for set to 0)
Your compiler may or may not have locking intrinsics for these operations.
If not, now is the time to learn how to write these yourself...
Or fall back onusing interlocked compare and exchange.

The XCHG implicitly has a LOCK so the prefix is not required.
Both XCHG and BTx with LOCK are thread safe and will not require a compare and its potential for having to retry.

assuming you have an intrinsic

char interlockedXCHG(char* p, char v);

char resourceInUse = 0; // 1 = in use
...
while(interlockedXCHG(&resourceInUse, 1))
_mm_pause();
... use the resource
resourceInUse = 0; // free the resource

or

int interlockedBTS(__int32* p, __int32 b);
int interlockedBTR(__int32* p, __int32 b);
__int32 resourcesInUse = 0;
const __int32 resourceX = 0;
const __int32 resourceY = 1;
...
while(interlockedBTS(&resourcesInUse, resourceX))
_mm_pause();
...
interlockedBTR(&resourcesInUse, resourceX);

Note, using CHAR requires 1 LOCKed instruction, using BTx requires 2 LOCKed instructions.

Size verses overhead.

Jim Dempsey

Jim Dempsey

RafSchietekat · ‎11-14-2010

But wasn't Dmitriy's suggestion meant to avoid expensive locked instructions for non-RMW operations?

ksgokul · ‎11-15-2010

Thanks for all the suggestions. We have finally decided to maintain a 16 byte set ( 16 bytes in the place of 1 bit ) for handling concurrent requests and a single thread maintained bitset for storage purposes. We could find a lot more data to be put in along with the bit to be shared and accessed. So the problem is solved for me right now.

I especially thank Jim and Dmitriy for their valuable suggestions.

Thanks,

Gokul.