Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Low contention access to large data collection

ClayB
New Contributor I
1,453 Views
An article whose URL was posted to the lock-free thread caught my eye. It's at MSDN Concurrent Access.

The basic idea is one I "discovered" on my own several years ago. The two extremes of ensuring exclusive access to data items within a collection are 1) lock the entire collection of data for a single thread, or 2) lock individual elements as threads need them. The former can lead to threads waiting idle as the one thread allowed access to the data holds up other threads that may not interfere with each other; the latter can use a lot of memory as each item in the data set has to have a lock object as part of it's memory allocation (i.e. 10000 locks for 10000 items to be used by 8 threads).

A good compromise is something in the middle. Specifically, use a fixed number of locks (256 in the paper cited above), then hash a key (or other unique identifier of data items) to determine the corresponding lock that must be held by a thread to access the data. The probability of two threads contending on the same lock will be the reciprocal of the number of locks used (assuming a good hash function).

For those fans of lock-free algorithms, this scheme can be easily implemented as a stopgap measure to quickly implement mutual exclusion until a lock-free solution can be devised. But does this scheme work in all data access cases? Are there some operations of data access that might not be correctly or adequately protected with this low contention scheme?

Anyone else have any thought or ideas on the utility of using a fixed number of locks for protecting access to large data collections?

-- clay
0 Kudos
8 Replies
Intel_C_Intel
Employee
1,453 Views
> For those fans of lock-free algorithms, this scheme
> can be easily implemented as a stopgap measure to
> quickly implement mutual exclusion until a lock-free
> solution can be devised.

Yep.


> But does this scheme work
> in all data access cases?

Nope, but there are workarounds. ;)


> Are there some operations
> of data access that might not be correctly or
> adequately protected with this low contention
> scheme?

If you ?need? to adhere to a strict locking order, then you might experience deadlock.


For instance, if you need to go through ?two or more? locks to get at your data.

Lets say the locking rules for accessing a generic child object goes like this:


Lock rules for parent access.

1. LockTheParent()
2. LockTheChild()
3. UnlockTheChild()
4. UnlockTheParent()

Lock rules for child access.

1. LockTheChild()
2. UnlockTheChild()

Each parent and child is supposed to use different critical sections.

If the hash function happens to collide two different pointers to the same lock? Well, that would deadlock those simple rules real fast! =)

However, there is a work around?

The Parents are completely ?separate? objects from their Children, they only know of each other through their relationships. Since they are separate, they will ?each? need to have a 256 lock table protecting them in order to avoid deadlock.


>
> Anyone else have any thought or ideas on the utility
> of using a fixed number of locks for protecting
> access to large data collections?


Yep, look at my queue:

Queue.c


The internal queue storage is broken down into ?sub-queues?. Each sub-queue allows for a single concurrent enqueue and dequeue. Testing on my sub-queue design, show that it avoids going into the kernel due to critical section contention 30+ times less than all the ?normal? queues out there. I?ve seen it avoid 30,000,000+ calls into the kernel under heavy load runs.

It almost performs as good as lock-free. It also uses critical section tables to protect the front and back pointers.


By the way, here are some more clever uses for lock-free:

A ?very fast? semaphore:

CountedSection.c


A ?very fast? read / write lock:

ReadWriteSection.c


Think of a fast, read / write lock hash table!

What do you think of those speedy algos?

;)
0 Kudos
daradder
Beginner
1,453 Views
An interesting technique. However I am puzzled as to why the memory
overhead of the alternatives is thought so forbidding. Use a bitmap to
hold an individual lock for each member of the collection, then the
overhead is only one bit per member.

Bitmaps are powerful. With a cascading bitmap data structure you can,
for hardly any greater space requirement, rapidly find and lock the first
unlocked item in the collection (lock-free, even though I keep using the word
lock ! ). The algorithm for that is described at:

LeapHeap bitmap algorithm

The bitmap techique may not fit neatlyl with the specific example in the
MSDN magazine article originally referenced, but locking doesn't HAVE to
be as complex as they make it there.

Cheers,
DarAdder

0 Kudos
ClayB
New Contributor I
1,453 Views
I think the ratio of sync object to memory is one of the sticking points. If a CRITICAL_SECTION is 24 bytes, then an object or struct that also has six integers (e.g., XYZ-coordinates, RGB, Stargate addresses with home symbol understood) will use 50% of the memory space just for the sync object.

Secondly, if your application uses 100,000 objects, 2.3MB of memory is devoted to sync objects. But if only 8 threads are executing over this data, this seems like an awful waste of space since at most only 192(=8x24) bytes would be in use at any one time. Only those times when it is more likely that the threads will actually be in contention for a single object does it make sense to block a thread in order to provide exclusive access.

Certainly using bit-locks could solve the memory size issues. 100,000 objects would only need 12KB of memory for the entire set of bit-locks. Only tier-1 locks are needed since the code should be able to compute the address of the corresponding lock-bit from the address of the object. (No need to search, as is done when trying to find a block of memory that can be allocated to the calling thread detailed in the algorithm DarAdder cited.)

Still, at some point there needs to be some atomic operations. The article sited makes mention of 9 and 15 instructions which I assume is to show the speed at which the search and setting of bits can be done. However, the longer such operations take to locate and switch even a bit, the more likely it is that other threads can contend for the same resource and cause problems. At one extreme you can lock the whole bit-lock data, but this would serialize access. You could set up a series of sync objects to protect, say, each 4-byte collection of bit-locks, but this is just the original scheme with another layer of "indirection".

The longer I sit and ruminate over this, the more I waffle as to whether or not it can be done with a lock-free algorithm. It does seem possible, but it would not be in the strictest sense of "lock-free" (depending on your definition) since other threads not using the desired object could keep a thread from access. That is, given whatever smallest addressable amount of bits that will hold the desired lock-bit, another thread could change some other bit contained in this collection that would then invalidate the check-and-set test, even though the object needed by the first thread was not locked.

-- clay
0 Kudos
bronx
Beginner
1,453 Views
> and switch even a bit, the more likely it is that
> other threads can contend for the same resource and
> cause problems. At one extreme you can lock the

sure, with 128 flags per L2 cache line we can expect some major slowdowns due to false line sharing on SMP Xeons for example

although I have no practical experience with massive nr of locks, I guess that some hybrid algorithm between the hash code lookup (using "macro locks" stored 128-byte appart) and the bitmaps to resolve hashcode collisions, will give us the best of both ideas
0 Kudos
ClayB
New Contributor I
1,456 Views
> sure, with 128 flags per L2 cache line we can expect
> some major slowdowns due to false line sharing on SMP
> Xeons for example

I hadn't thought of false sharing being a problem, but it certainly can be a factor. I don't think I'd say "major slowdowns" since with 100,000 records there would be about a 1 in 800 chance of two threads hitting the same L2 cache line for lock bits. That assumes random access, of course.

If there is some predictable or computable pattern to the access of locks (e.g. threads do linear scan through list and process next unseen object), proper planning and distribution of lock bits could reduce the risk of false sharing to a near impossibility.
0 Kudos
ClayB
New Contributor I
1,456 Views
> Anyone else have any thought or ideas on the utility
> of using a fixed number of locks for protecting
> access to large data collections?

I think I was being too vague in my original post (or I've been a teacher too long and expect hands to go up when I finish with a question). What I was driving towards was a "discussion" on how access methods are supported by certain choices of data structure implementation. (2 Gold stars if you realized this already.)

Data structures are defined by the operations to access data. Push and pop for stacks, enqueue and dequeue for queues, etc. Different implementations give better or worse performance and the object of good design is to find an implementation that balances performance based on the relative number of each operation that is expected. For example a singly linked list with only a head pointer can dequeue an item in O(1) time, but would take O(n) to search the whole list in order to find the end of list for the place to do an enqueue. Add a "tail" pointer and enqueue becomes an O(1) operation.

With the fixed number of locks method, what operations are able to be implemented efficiently? In this case, the efficiency metric would be low thread contention. Update to a record would have low contention since only a single lock is needed to exclude other threads from accessing the chosen data. The contention of adding or deleting records would depend on how the collection of data was "held together". In a static array, there might be low contention to use an unfilled slot or erase values in a slot. With a linked structure, since the pointers in other records will need to be modified, more than one lock might need to be held and this could easily lead to deadlock if proper ordering was not observed between all threads.

For the add/delete problem, you might implement an overall "add/delete" lock that would be granted to a thread requesting it only after all other threads were guaranteed to not be holding any of the other locks. This is like the Readers/Writers locking schemes. Since holding such a lock prevents all other threads from accessing the data, this would still entail high thread contention, but if adding/deleting records was a very infrequent operation, the contention may be acceptable.

I have enjoyed the discussions that have come from the original post. Not what I was expecting, but very instructive nonetheless.

-- clay
0 Kudos
bronx
Beginner
1,456 Views
>hought of false sharing being a problem,
> but it certainly can be a factor. I don't think I'd
> say "major slowdowns" since with 100,000 records
> there would be about a 1 in 800 chance of two threads
> hitting the same L2 cache line for lock bits. That
> assumes random access, of course.

yes provided that the access pattern is evenly distributed which is often not the case. It's quite common to see threads working on the same problem like with team algorithms / blackboards. In these cases the probability of access to the same objects with temporal locality is higher. The threads can visit the whole database still accessing nearby locations at a given time, like a swarm of bees travel over long distances without much distance between individual bee

Anyway sending the line and getting it back will require at least 16 bus transactions (assuming a 128-bit memory interface), even with 1/800 probability the impact can be important with a 533/3066 (FSB/core) clock ratio



> If there is some predictable or computable pattern to
> the access of locks (e.g. threads do linear scan
> through list and process next unseen object), proper
> planning and distribution of lock bits could reduce
> the risk of false sharing to a near impossibility.

it looks very hard to implement, it will require special purpose code (to be maintened, must stay in sync with changes in the main algorithm) for devising an efficient distribution and the bitmap <=> addresses mapping will be more complex.

0 Kudos
daradder
Beginner
1,456 Views
I may have strayed off-topic with my previous post, causing some confusion. Yes, if you must lock one specific datum (rather than any one of a set of suitable data) you can use a simple bitmap with overhead of one bit per lock. A lock is seized (Pentium processor) with the single instruction LOCK BTS. From a programmer's perspective this is unaffected by other threads concurrently accessing nearby bits; from a broader perspective it can induce cache-sloshing, as other posters have pointed out.

The hashlocking technique seems most useful in situations where each lockable datum does not have an index or address which can speedily be related to a bit offset within a bitmap. As Mr LockFreeX has cleverly spotted, it risks deadlock if a thread holds several locks at once.

I think the point I was trying to make in my first post was about inverse abstraction. Locking seems so complex and mysterious when expressed in the 'C' or C++ languages. OTOH a complete routine to atomically seize a specific bit within a specified bitmap is about six opcodes of inline assembler (albeit non-portable)

Dar
0 Kudos
Reply