Re: Locking elements in a matrix

mikedeskevich · ‎01-22-2008

I have a question about how to handle locking parts of a matrix. First, I guess I should use the term 2d-Array rather than matrix so that people aren't getting the mathematical idea of a matrix. My data structure is just a 2d-Array of numbers and not being treated like a mathematical matrix.

Lets say in my matrix I need to update one element and I need it to be locked while updating (I cannot use an atomic operation here), but I don't want to lock the entire matrix because I want another thread to be able to update another element in a different part of the matrix. Does it make sense for me to have a matrix of mutexs in parallel with the matrix and then just grab a spin lock for the element when I need to work with it? Or is there a smarter way to do what I want to do?

Mike

Alexey-Kukanov · ‎01-23-2008

I would suggest you lock whole rows, columns, or blocks, depending on your processing pattern(s). It's not even necessary to have a lock per row or column; practically, for low contention the number of locks should exceed the number of threads (assuming that threads usually process data protected by different locks - this is why data processing pattern is an important consideration in choosing a locking pattern). Quitea common schema is to have an array of M locks and use modulo M calculations to determine the exact lock protecting a particular data element (or row, or column).

Per-element locking is also possible without the special array of locks. One might add a special byte into the data structure for sake of locking. For instance, you could try tbb::atomic for that. Still, it will require more space than the previous approach. Sometimes, spare bits in data fields might be used for locking; but this approach requires understanding of platform-specific low-level machinery of atomic operations, memory fences to prevent reordering by HW or compiler, etc., andI would not recommend using it in an application.

Also I want to add that protecting only updates is usually not enough in presense of non-updating reads. In general case, for deterministic results data reads should also be protected by the same lock as data updates. If reads are concurrent and their ratio to updates is high enough, then read-write locks might decrease contention to some extent.

mikedeskevich · ‎01-23-2008

Alexey, thanks for the advice. I looked into the atomic operations a little more closely than the first time, and I think I can get away with using compare_and_swap without locks. It looks like from the documentation (the O'reilly book) that the default memory fences should work for me.

I think this will work the best for me, as there's no processing pattern than I can follow. And it's also recursive (the recursive nature is ideal for me to use some type of tbb::task idiom) so having a fixed number of locks will be hard to keep track of: updating one element can cause anywhere from 0 to the all of the other elements to be updated.

Alexey-Kukanov · ‎01-23-2008

mikedeskevich:
Alexey, thanks for the advice. I looked into the atomic operations a little more closely than the first time, and I think I can get away with using compare_and_swap without locks. It looks like from the documentation (the O'reilly book) that the default memory fences should work for me.

If you use TBB atomic template, then we already cared about right placing of memory fences, at least for supported platforms and compilers. The default full fence for compare_and_swap is reasonable, at least if you do not target platforms with weak consistency memory model such as Intel Itanium processor-based systems.

mikedeskevich:
...updating one element can cause anywhere from 0 to the all of the other elements to be updated.

Be careful to not fall into a deadlock then :)

Alexey-Kukanov · ‎01-23-2008

MADakukanov:
Be careful to not fall into a deadlock then :)

This my comment is valid if you plan using tbb::atomic as a locking mechanism.

If you update data with atomic operations, there will be no deadlock of course; but you will need to care that data remain consistent while concurrently updated.

mikedeskevich · ‎02-05-2008

I'm have another question relating to my first posting of updating a 2d-array. I've been working on this for a while and I can't seem to use the tbb::atomic operations like I wanted to (I'm running into an A-B-A type problem). This seems like a simple operation to me, but I'm drawing a blank, and I'm hoping that someone can give me a hint on where to go from here.

Let's say that the elements in my matrix are 32bit integer bitmasks and when I need to update an element, I'm just flipping a bit. In a serial approach, I can just use "|=" or "&=~" to turn on or off the bit, respectively. Is there an atomic operation that I can use to do that? I only see atomic addition (and incrementing), but I've got to imagine that bitwise operations are also simple enough that they could be atomic.

Thanks,
Mike

robert-reed · ‎02-08-2008

There's nothing in the Intel Architecture that prevents locking of AND and OR instructions, although to implement locks, you might well prefer to use something like BTS, Bit Test and Set, which is also lockable. If it were to work, each thread wanting to lock an element of the matrix would BTS the appropriate bit and check what they got--if they got a 1, some other thread already had the lock; a zero would indicate that it was previously unlocked and the thread getting the zero could assume the lock and zero the bit when they unlock.

There are a couple of problems here, though. The practical problem, it seems you've noticed, is that TBB does not support atomic semantics for operator|= or operator&=, not at the level of atomic.h, nor down in the machine code implementations where the lock prefixes are specified (you can explore this in the TBB open source). Implementing AND and OR as these operators would require some work and gro. There's no canonical operator to overload for BTS, but some function-based implementation is certainly conceivable.

But there's a bigger problem with this, because of cache lines. Current machines use 64 byte cache lines, or 512 bits worth of locks. This poses a massive problem of false sharing. Any time a thread locks an element, it will need to take ownership of the cache line containing the lock. Any other threads needing to lock any other element whose lock bitis in the same cache line will have to wait until the first thread can write the cache line to memory. Because of the number of locks linked together on a cache line, all threads will have big feet and will step on each other for no logical reason. And as you add more threads, you get more feet and more potential for thrash. You could reduce the coupling by increasing the size of the lock array, say use a byte or a short for each lock. There would still be coupling and the chance for false sharing, but the frequency would be reduced.

robert-reed · ‎02-08-2008

p.s., oops. Looks like I didn't finish a sentence in my previous post. I was suggesting that some work would be involved to implement atomic semantics for more operators, and it would increase the number of functions implmented in the machine specific headers, increasing the effort it would take to port TBB.