Lets say in my matrix I need to update one element and I need it to be locked while updating (I cannot use an atomic operation here), but I don't want to lock the entire matrix because I want another thread to be able to update another element in a different part of the matrix. Does it make sense for me to have a matrix of mutexs in parallel with the matrix and then just grab a spin lock for the element when I need to work with it? Or is there a smarter way to do what I want to do?
I would suggest you lock whole rows, columns, or blocks, depending on your processing pattern(s). It's not even necessary to have a lock per row or column; practically, for low contention the number of locks should exceed the number of threads (assuming that threads usually process data protected by different locks - this is why data processing pattern is an important consideration in choosing a locking pattern). Quitea common schema is to have an array of M locks and use modulo M calculations to determine the exact lock protecting a particular data element (or row, or column).
Per-element locking is also possible without the special array of locks. One might add a special byte into the data structure for sake of locking. For instance, you could try tbb::atomic
Also I want to add that protecting only updates is usually not enough in presense of non-updating reads. In general case, for deterministic results data reads should also be protected by the same lock as data updates. If reads are concurrent and their ratio to updates is high enough, then read-write locks might decrease contention to some extent.
I think this will work the best for me, as there's no processing pattern than I can follow. And it's also recursive (the recursive nature is ideal for me to use some type of tbb::task idiom) so having a fixed number of locks will be hard to keep track of: updating one element can cause anywhere from 0 to the all of the other elements to be updated.
mikedeskevich:Alexey, thanks for the advice. I looked into the atomic operations a little more closely than the first time, and I think I can get away with using compare_and_swap without locks. It looks like from the documentation (the O'reilly book) that the default memory fences should work for me.
If you use TBB atomic template, then we already cared about right placing of memory fences, at least for supported platforms and compilers. The default full fence for compare_and_swap is reasonable, at least if you do not target platforms with weak consistency memory model such as Intel Itanium processor-based systems.
mikedeskevich:...updating one element can cause anywhere from 0 to the all of the other elements to be updated.
Be careful to not fall into a deadlock then :)
MADakukanov:Be careful to not fall into a deadlock then :)
This my comment is valid if you plan using tbb::atomic as a locking mechanism.
If you update data with atomic operations, there will be no deadlock of course; but you will need to care that data remain consistent while concurrently updated.
Let's say that the elements in my matrix are 32bit integer bitmasks and when I need to update an element, I'm just flipping a bit. In a serial approach, I can just use "|=" or "&=~" to turn on or off the bit, respectively. Is there an atomic operation that I can use to do that? I only see atomic addition (and incrementing), but I've got to imagine that bitwise operations are also simple enough that they could be atomic.
There's nothing in the Intel Architecture that prevents locking of AND and OR instructions, although to implement locks, you might well prefer to use something like BTS, Bit Test and Set, which is also lockable. If it were to work, each thread wanting to lock an element of the matrix would BTS the appropriate bit and check what they got--if they got a 1, some other thread already had the lock; a zero would indicate that it was previously unlocked and the thread getting the zero could assume the lock and zero the bit when they unlock.
There are a couple of problems here, though. The practical problem, it seems you've noticed, is that TBB does not support atomic semantics for operator|= or operator&=, not at the level of atomic.h, nor down in the machine code implementations where the lock prefixes are specified (you can explore this in the TBB open source). Implementing AND and OR as these operators would require some work and gro. There's no canonical operator to overload for BTS, but some function-based implementation is certainly conceivable.
But there's a bigger problem with this, because of cache lines. Current machines use 64 byte cache lines, or 512 bits worth of locks. This poses a massive problem of false sharing. Any time a thread locks an element, it will need to take ownership of the cache line containing the lock. Any other threads needing to lock any other element whose lock bitis in the same cache line will have to wait until the first thread can write the cache line to memory. Because of the number of locks linked together on a cache line, all threads will have big feet and will step on each other for no logical reason. And as you add more threads, you get more feet and more potential for thrash. You could reduce the coupling by increasing the size of the lock array, say use a byte or a short for each lock. There would still be coupling and the chance for false sharing, but the frequency would be reduced.