- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been reading "The Art of Multiprocessor Programming" which seems to have nothing but good things to say about atomic operations. They sound nice, but as a pessimist I did a benchmark today (the code is at the bottom of this post).
I did a benchmark with a global atomic
#cores | Time(s) | Final Counter Value | #increments/s | Speedup over 1 Core |
1 | 1.43 | 100000000 | 70141475.36 | 1 |
2 | 5.98 | 847055237 | 141611090.92 | 2.02 |
3 | 11.8 | 901812169 | 76405982.34 | 1.09 |
4 | 12.06 | 750338611 | 62196502.9 | 0.89 |
5 | 15.18 | 836765059 | 55139935.22 | 0.79 |
6 | 18.73 | 945415180 | 50477870.5 | 0.72 |
7 | 18.01 | 948627422 | 52678696.01 | 0.75 |
8 | 54.63 | 2379871615 | 43564811.92 | 0.62 |
I should really say that past two cores the atomic operations do not scale well. I conducted another test, using distributed atomic operations, where private atomic
It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.
Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have be tter (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).
Also, I love having tbb::tbb_thread for these kinds of benchmarks!
Source code here:
// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address
#include
#include
#include
#include
#include
#include
const long long numIncrementsForTrial = 100000000;
tbb::atomic
// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address
#include
#include
#include
#include
#include
#include
const long long numIncrementsForTrial = 100000000;
tbb::atomic
// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}
// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address
#include
#include
#include
#include
#include
#include
// I forgot #include
const long long numIncrementsForTrial = 100000000;
tbb::atomic
// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};
// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
& nbsp; void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
std::vector< std::pair<:TBB_THREAD> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
std::cout << "Final Value: " << atomicRegister << std::endl;
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}
{
atomicRegister.fetch_and_increment();
}
}
};
// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increme nt timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
std::vector< std::pair<:TBB_THREAD> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
std::cout << "Final Value: " << atomicRegister << std::endl;
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}
// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}
// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};
// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
std::vector< std::pair<:TBB_THREAD> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
std::cout << "Final Value: " << atomicRegister << std::endl;
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Of course this might colour the results of any benchmarks that compare locking and lock-free algorithms... so it would be interesting to see further test results (and/or informed opinions).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What I'm specifically looking at is the throughput of atomic operations, since most (if not all) wait-free algorithms that I know of rely upon them. Realistically atomic operations would be distributed across processors, and used quite frequently if wait-free algorithms depending upon them were used.
Quoting from the Intel manual "Volume 3A: System Programming Guide, Part 1"...
Section 7.1.4
"For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation
as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow its cache coherency mechanism to insure that the operation is carried out atomically."
This seems to imply that the bus might lock even if the memory being operated upon is in the processor's cache. The bus locking / unlocking could describe the performance that has been observed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bus locking went out in the 90's.
Try the example with a separate atomic variable, on separate cache lines, for each thread and I think you'll see it scale just fine.
Besides cache line transfers, there are other costs for atomic operation when they involve the x86 LOCK prefix (e.g. as for atomic increment), because the LOCK prefix takes away efficiencies of out-of-order pipelined execution:
- LOCKed instructions are serializing instructions. As Section 7.4 of previously mentiond Volume 3Asays, these instructions cannot execute until all prior instructions execute and pending stores are written.
- LOCKed instructions imply afullmemory fence.Hence subsequent loads cannot be done ahead of time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As for efficiencies of LOCKed instructions: are the results of these instructions supposed to be visible to any observer, not just those participating in the coherent cache? That might make matters worse than otherwise, it seems. I think I have seen examples of architectures providing both coherent-cache-only fences and ones that go all the way to memory, and/or that distinguish between data-only traffic and the intricacies of self-modifying code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The times slowdown is calculated as the execution time for n threads / execution time for 1 thread.
Here are the results for a single atomic
#cores | Time(s) | Times Slower than 1 Core |
1 | 1.42 | 1 |
2 | 5.84 | 4.12 |
3 | 11.75 | 8.29 |
4 | 10.26 | 7.23 |
5 | 18.29 | 12.89 |
6 | 16.57 | 11.68 |
7 | 36.29 | 25.58 |
8 | 14.88 | 10.49 |
Note that usually the time for 8 cores is much longer (around 45 - 50s), in this particular execution it was substantially lower for some reason.
I rewrote the benchmark with the same idea in mind, however this time I gave each thread a copy of a class which contained its own atomic
#cores | Time(s) | Times Slower than 1 Core |
1 | 1.42 | 1 |
2 | 5.91 | 4.16 |
3 | 15.08 | 10.62 |
4 | 22.62 | 15.94 |
5 | 1.43 | 1 |
6 | 5.14 | 3.62 |
7 | 8.24 | 5.81 |
8 | 26.27 | 18.51 |
These results seem to imply that since the last thread is always the one measured, that when there are 5 threads the measuring thread has its own core. There is an interesting trend here, since there are two quad-core processors. This is still not the performance I was expecting, I was hoping to have a scaled performance where each core took the same amount of time to perform a fixed number of atomic increments on atomic values on its own cache line.
Here is the code for the new test:
#include
#include
#include
#include
#include
#include
const long long numIncrementsForTrial = 100000000;
typedef tbb::atomic
// Class that infinitely increments a local atomic value
class InfinitelyIncrement
{
private:
int _id;
atomic_t _register;
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
_register.fetch_and_increment();
}
}
};
// Class that increments a local atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
&n bsp; atomic_t _register;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
_register.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
tbb::concurrent_vector< std::pair<:TBB_THREAD> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's the latest code... it shows that atomic operations on different cache lines scale beautifully... which means there is a lot of hope for wait-free data structures and algorithms.
// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address
#include
#include
#include
#include
#include
#include
const long long numIncrementsForTrial = 100000000;
typedef tbb::atomic
// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
atomic_t _register;
int padding[8000];
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
_register.fetch_and_increment();
}
}
};
// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
atomic_t _register;
int padding[8000];
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
_register.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
std::vector< std::pair<:TBB_THREAD> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm wondering how this could possibly be "perfectly" scalable on an architecture that fully orders atomic operations, because there will have to be some form of communication between the cores to make that happen, even without bus locking.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If two trees fall in a sequentially consistent forest, and no one sees them fall, did they fall in order?
The example scales perfectly because no communication is required. Each thread holds its cacheline in the 'Modified' state, meaning no other processor can see it. Lock-free algorithms do require communication.
Amore representative, yet still simple, test would be "Counting Philosophers". Arrange the N threads around a table and have pairs of adjacent threads communicate using a lock-free algorithm. The simplest such algorithm is fetch-and-add on a shared counter. So let each thread continuously increment the counter on its left and decrement the counter on its right. See if that scales.
I recommend the increment/decrement combination because it has a simple invariant. Assuming the counters are zero to start with, after all threads are done the total sum must be zero.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
aj.guillon@gmail.com:
I did a benchmark with a global atomicvariable, which is incremented by N-1 worker threads infinitely, and the Nth thread performs an increment a fixed number of times (100,000,000) and times how long this takes. This should demonstrate the worst performance for a highly-contested atomic variable. The result is very poor, and I'm trying to understand why.... and what to do to improve performance. The test was done on a Linux system, source was compiled with icc, and it's a dual-quad-core xeon with 64-bit extensions, and TBB 2.1. Here are the results:
...
I should really say that past two cores the atomic operations do not scale well. I conducted another test, using distributed atomic operations, where private atomicvariables were allocated with each object to be threaded, and each object was allocated with the cache_aligned_allocator to ensure that the atomic objects were on different cache lines. Timing did not seem to improve much, I did not perform a detailed analysis as above. I remember reading something about atomic operations locking the bus, and this would explain the numbers above.
It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.
Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).
Atomic operations have some fixed price, but this price is not very high, and they scale linearly.
SHARING! Sharing is what has very high price and super-linear negative scalability.
Watch out for SHARING, not for atomic operations.
If you will construct test with plain loads and stores to the same cache-line, then you will get results similar to that of atomic operations, i.e. heavy price and negative scalability. On the other hand, atomic operations on private locations have linear scalability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
aj.guillon@gmail.com:
Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).
Wait-free (lock-free) data structures are not about performance/scalability, they are only about forward-progress guarantees. Thus they inevitably use atomic RMW operations everywhere.
There is another class of synchronization algorithms - atomic-free algorithms, they exactly about performance/scalability. But they can provide lock-free, wait-free, obstruction-free, blocking forward guarantees; they can use atomic RMW operations, atomic loads/stores, mutexes, and whatever they want. They are not concerned with forward-progress guarantees, but often they also provide lock-free, or even wait-free guarantees... at least on fast-path.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
aj.guillon@gmail.com:
Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).
Future hardware won't help! If you have highly-contented centralized design, no hardware can help you to make it scalable. Sharing is not scalable by definition!
If you need scalability, you must design zero-sharing, cache coherence solicitous, data layout solicitous algorithms. And they will do perfectly scale even on current hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Future hardware could help in some circumstances by making some forms of sharing scalable.Specifically, atomic fetch-and-associative-op on a single shared location can be implemented in a scalable way. See "Efficient synchronization of multiprocessors with shared memory", TOPLAS 10(4), 1988.
Whether future hardware will help is a separate question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MADadrobiso:Future hardware could help in some circumstances by making some forms of sharing scalable.Specifically, atomic fetch-and-associative-op on a single shared location can be implemented in a scalable way. See "Efficient synchronization of multiprocessors with shared memory", TOPLAS 10(4), 1988.
Whether future hardware will help is a separate question.
Unfortunately I don't have an ACM account... I can surmise that they are talking about "atomic fetch-and-associative-op on a single shared location which doesn't issue any memory fences". Memory fences usually needed for straightforward synchronization algorithms. And memory fences can be quite expensive. For example on Pentium 4 mfence is about 100 cycles. But on Core architecture mfence is significantly cheaper.
Also assume we have multi-producer/multi-consumer fifo queue. Let's even assume that we make it lightning fast. But the algorithm inherently requires physical movement of user data from thread to thread (from core to core). And this movement won't be fast. The same story with mutexes. Even if mutex itself is lightning fast, what about user data?
Good hardware primitives can make *good* synchronization algorithms even better. But I don't think that they can help to bad synchronization algorithms. Anyway one have to carefully design synchronization algorithm in the first place, keeping in mind a set of available hardware primitives.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CiteSeer has the tech report version of the paper I cited. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.6013. There's a "DOWNLOAD" link in the upper right portion. The paper comes from an age when interconnect could keep up with execution units. Hypercubes, perfect-shuffle networks, and other topologies were hot back then. But even then in the mid 80's I recall the prof. commenting that the computational power that can be crammed in a sphere is proportional to the cube of its radius, but the bandwidth grows as the square. So in the long run localitywould matter in our 3D universe. It didn't take long to get there :-)
That said, a scalable fetch-and-add primitive wouldbe nice to have. It would make concurrent reference counting and related algorithms cheap and easy to implement.
Uzi Vishkin's XMT has some interesting notions for getting back to machines where locality is not so critical, at least on a single chip.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MADadrobiso:CiteSeer has the tech report version of the paper I cited. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.6013. There's a "DOWNLOAD" link in the upper right portion. The paper comes from an age when interconnect could keep up with execution units. Hypercubes, perfect-shuffle networks, and other topologies were hot back then. But even then in the mid 80's I recall the prof. commenting that the computational power that can be crammed in a sphere is proportional to the cube of its radius, but the bandwidth grows as the square. So in the long run localitywould matter in our 3D universe. It didn't take long to get there :-)
That said, a scalable fetch-and-add primitive wouldbe nice to have. It would make concurrent reference counting and related algorithms cheap and easy to implement.
Uzi Vishkin's XMT has some interesting notions for getting back to machines where locality is not so critical, at least on a single chip.
Thank you for the links. I need some time to study them.
I am not a hardware guy, but my understanding is that local processing will always be faster than remote communication. No matter how we implement communication. If communication will become faster, then local processing will become faster too. It's endless race. And we are already almost bounded by speed of light wrt communication.
As for new hardware primitives. Well, it would be interesting to see what HTM can afford to us. It seems that HTM is the nearest approaching addition to hardware wrt synchronization. Sun must push HTM to the market in the 2008. And it seems that AMD is seriously working on HTM too. Will it solve all problems wrt synchronization? Will it be suitable for highly scalable algorithms?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a thought for one scalable data structure which could use atomic instructions very rarely: unordered_vector
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
aj.guillon@gmail.com:I thought that it was proven impossible to scale algorithms to n processors without the use of an atomic operation somewhere? It doesn't have to be highly contested, it just has to exist somewhere.
It's perfectly OK to create synchronization algorithm w/o any atomic RMW operations.
As for algorithms which require 'strong competition resolving' between threads, for example mutex (threads have to decide in concord who is the winner, who will enter critical section), or multi-consumer queue (threads have to decide in concord who is the winner, who will consume this particular element). There are 2 ways for 'strong competition resolving':
1. Threads execute atomic RMW operation on one memory location. I hope it's clear, no comments here.
2. Threads execute critical 'store X -> #StoreLoad style membar -> loadY' sequence of operations. One thread stores to X and loads Y, and another thread stores to Y and loads X. The examples are Peterson's mutex algorithm, or SMR (safe memory reclamation) algorithm. It must be noted that 'false positives' are possible here (i.e. both threads will other's store), but 'false negatives' are not possible (both threads miss other's store).
And there also synchronization algorithm which do not require 'strong competition resolving', for example single-producer/single-consumer fifo queue. Such algorithms are based on weak casual relations, and do not require any atomic RMW or #StoreLoad style membars. Usually they are based only on load-acquire/store-release operation, and in the limit they do not require any memory fences at all, only atomic stores and loads.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page