I would suggest you lock whole rows, columns, or blocks, depending on your processing pattern(s). It's not even necessary to have a lock per row or column; practically, for low contention the number of locks should exceed the number of threads (assuming that threads usually process data protected by different locks - this is why data processing pattern is an important consideration in choosing a locking pattern). Quitea common schema is to have an array of M locks and use modulo M calculations to determine the exact lock protecting a particular data element (or row, or column).
Per-element locking is also possible without the special array of locks. One might add a special byte into the data structure for sake of locking. For instance, you could try tbb::atomic
Also I want to add that protecting only updates is usually not enough in presense of non-updating reads. In general case, for deterministic results data reads should also be protected by the same lock as data updates. If reads are concurrent and their ratio to updates is high enough, then read-write locks might decrease contention to some extent.
mikedeskevich:Alexey, thanks for the advice. I looked into the atomic operations a little more closely than the first time, and I think I can get away with using compare_and_swap without locks. It looks like from the documentation (the O'reilly book) that the default memory fences should work for me.
If you use TBB atomic template, then we already cared about right placing of memory fences, at least for supported platforms and compilers. The default full fence for compare_and_swap is reasonable, at least if you do not target platforms with weak consistency memory model such as Intel Itanium processor-based systems.
mikedeskevich:...updating one element can cause anywhere from 0 to the all of the other elements to be updated.
Be careful to not fall into a deadlock then :)
MADakukanov:Be careful to not fall into a deadlock then :)
This my comment is valid if you plan using tbb::atomic as a locking mechanism.
If you update data with atomic operations, there will be no deadlock of course; but you will need to care that data remain consistent while concurrently updated.
There's nothing in the Intel Architecture that prevents locking of AND and OR instructions, although to implement locks, you might well prefer to use something like BTS, Bit Test and Set, which is also lockable. If it were to work, each thread wanting to lock an element of the matrix would BTS the appropriate bit and check what they got--if they got a 1, some other thread already had the lock; a zero would indicate that it was previously unlocked and the thread getting the zero could assume the lock and zero the bit when they unlock.
There are a couple of problems here, though. The practical problem, it seems you've noticed, is that TBB does not support atomic semantics for operator|= or operator&=, not at the level of atomic.h, nor down in the machine code implementations where the lock prefixes are specified (you can explore this in the TBB open source). Implementing AND and OR as these operators would require some work and gro. There's no canonical operator to overload for BTS, but some function-based implementation is certainly conceivable.
But there's a bigger problem with this, because of cache lines. Current machines use 64 byte cache lines, or 512 bits worth of locks. This poses a massive problem of false sharing. Any time a thread locks an element, it will need to take ownership of the cache line containing the lock. Any other threads needing to lock any other element whose lock bitis in the same cache line will have to wait until the first thread can write the cache line to memory. Because of the number of locks linked together on a cache line, all threads will have big feet and will step on each other for no logical reason. And as you add more threads, you get more feet and more potential for thrash. You could reduce the coupling by increasing the size of the lock array, say use a byte or a short for each lock. There would still be coupling and the chance for false sharing, but the frequency would be reduced.