Re: Atomic Increment Benchmark on Intel Hardware

AJ13 · ‎06-16-2008

Hey all,

I've been reading "The Art of Multiprocessor Programming" which seems to have nothing but good things to say about atomic operations. They sound nice, but as a pessimist I did a benchmark today (the code is at the bottom of this post).

I did a benchmark with a global atomic variable, which is incremented by N-1 worker threads infinitely, and the Nth thread performs an increment a fixed number of times (100,000,000) and times how long this takes. This should demonstrate the worst performance for a highly-contested atomic variable. The result is very poor, and I'm trying to understand why.... and what to do to improve performance. The test was done on a Linux system, source was compiled with icc, and it's a dual-quad-core xeon with 64-bit extensions, and TBB 2.1. Here are the results:

#cores	Time(s)	Final Counter Value	#increments/s	Speedup over 1 Core
1	1.43	100000000	70141475.36	1
2	5.98	847055237	141611090.92	2.02
3	11.8	901812169	76405982.34	1.09
4	12.06	750338611	62196502.9	0.89
5	15.18	836765059	55139935.22	0.79
6	18.73	945415180	50477870.5	0.72
7	18.01	948627422	52678696.01	0.75
8	54.63	2379871615	43564811.92	0.62

I should really say that past two cores the atomic operations do not scale well. I conducted another test, using distributed atomic operations, where private atomic variables were allocated with each object to be threaded, and each object was allocated with the cache_aligned_allocator to ensure that the atomic objects were on different cache lines. Timing did not seem to improve much, I did not perform a detailed analysis as above. I remember reading something about atomic operations locking the bus, and this would explain the numbers above.

It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.

Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have be tter (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Also, I love having tbb::tbb_thread for these kinds of benchmarks!

Source code here:

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include

const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include

const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include
// I forgot #include but icc compiled anyways
const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

& nbsp; void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increme nt timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

RafSchietekat · ‎06-18-2008

Maybe this is the wrong place to say this, but have you tested this on another processor that's not in the future? It seems only x86(_64) uses bus locking.

Of course this might colour the results of any benchmarks that compare locking and lock-free algorithms... so it would be interesting to see further test results (and/or informed opinions).

AJ13 · ‎06-18-2008

I'm not sure what you mean by "...on another processor that's not in the future". When I said future hardware I meant that perhaps better hardware could be designed to support atomic operations.

What I'm specifically looking at is the throughput of atomic operations, since most (if not all) wait-free algorithms that I know of rely upon them. Realistically atomic operations would be distributed across processors, and used quite frequently if wait-free algorithms depending upon them were used.

Quoting from the Intel manual "Volume 3A: System Programming Guide, Part 1"...

Section 7.1.4
"For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation
as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow its cache coherency mechanism to insure that the operation is carried out atomically."

This seems to imply that the bus might lock even if the memory being operated upon is in the processor's cache. The bus locking / unlocking could describe the performance that has been observed.

ARCH_R_Intel · ‎06-18-2008

Bus locking went out in the 90's.

Try the example with a separate atomic variable, on separate cache lines, for each thread and I think you'll see it scale just fine.

Besides cache line transfers, there are other costs for atomic operation when they involve the x86 LOCK prefix (e.g. as for atomic increment), because the LOCK prefix takes away efficiencies of out-of-order pipelined execution:

LOCKed instructions are serializing instructions. As Section 7.4 of previously mentiond Volume 3Asays, these instructions cannot execute until all prior instructions execute and pending stores are written.
LOCKed instructions imply afullmemory fence.Hence subsequent loads cannot be done ahead of time.

RafSchietekat · ‎06-18-2008

With "on another processor that's not in the future" I meant that you need not wait for "future hardware to have better (faster) support for [atomic operations]": several instruction sets that do not imply locking/serialisation can be tested right now, including from Intel (Itanium makes you choose between acquire and release semantics). I have not examined the test program like Arch, who however made a suggestion to test with separate cache lines, which you wrote did not seem to improve timing much but without providing code, so you appear to be out of synch with each other on that point.

As for efficiencies of LOCKed instructions: are the results of these instructions supposed to be visible to any observer, not just those participating in the coherent cache? That might make matters worse than otherwise, it seems. I think I have seen examples of architectures providing both coherent-cache-only fences and ones that go all the way to memory, and/or that distinguish between data-only traffic and the intricacies of self-modifying code.

AJ13 · ‎06-19-2008

I have performed the tests again. The table in my first post is in error, so this time I will better explain my measures and experiments. The charts below show the number of cores involved the execution. For a number of threads N, N-1 threads will be spawned which increment an atomic counter to infinitity. Another thread will be spawned which will delay for 5.0 seconds, before incrementing the same atomic counter for 100,000,000 iterations. Each thread is presumed to be mapped to its own core, and there is no load on the system during execution (I'm the only user).

The times slowdown is calculated as the execution time for n threads / execution time for 1 thread.

Here are the results for a single atomic value:

#cores	Time(s)	Times Slower than 1 Core
1	1.42	1
2	5.84	4.12
3	11.75	8.29
4	10.26	7.23
5	18.29	12.89
6	16.57	11.68
7	36.29	25.58
8	14.88	10.49

Note that usually the time for 8 cores is much longer (around 45 - 50s), in this particular execution it was substantially lower for some reason.

I rewrote the benchmark with the same idea in mind, however this time I gave each thread a copy of a class which contained its own atomic value to increment. The classes were allocated into a concurrent_vector, which by default uses the cache_aligned_allocator. Unless I'm mistaken, this should result in the atomic values being placed onto separate cache lines. Here are the resu lts, when multiple atomic values are on different cache lines:

#cores	Time(s)	Times Slower than 1 Core
1	1.42	1
2	5.91	4.16
3	15.08	10.62
4	22.62	15.94
5	1.43	1
6	5.14	3.62
7	8.24	5.81
8	26.27	18.51

These results seem to imply that since the last thread is always the one measured, that when there are 5 threads the measuring thread has its own core. There is an interesting trend here, since there are two quad-core processors. This is still not the performance I was expecting, I was hoping to have a scaled performance where each core took the same amount of time to perform a fixed number of atomic increments on atomic values on its own cache line.

Here is the code for the new test:
#include
#include
#include

#include

#include
#include

const long long numIncrementsForTrial = 100000000;

typedef tbb::atomic atomic_t;

// Class that infinitely increments a local atomic value
class InfinitelyIncrement
{
private:
int _id;
atomic_t _register;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
_register.fetch_and_increment();
}
}
};

// Class that increments a local atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
&n bsp; atomic_t _register;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
_register.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

tbb::concurrent_vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

RafSchietekat · ‎06-19-2008

Doesn't cache_aligned_allocator mean that segments are aligned, not elements (unless each is an integral number of cache lines long)? You would need to pad the elements to get what you need, I suppose. Plus, in your code, the concurrent_vector contains pairs of pointers, while the structures containing the atomics are allocated by plain-vanilla "new".

AJ13 · ‎06-19-2008

Raf.. you're right... it scales perfectly now! I'm not sure about cache_aligned_allocator actually, what I think it does changes from time to time....

Here's the latest code... it shows that atomic operations on different cache lines scale beautifully... which means there is a lot of hope for wait-free data structures and algorithms.

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include

#include
#include

const long long numIncrementsForTrial = 100000000;

typedef tbb::atomic atomic_t;

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
atomic_t _register;
int padding[8000];
public:
InfinitelyIncrement(int id) : _id(id)
{

}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
_register.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
atomic_t _register;
int padding[8000];
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
_register.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

RafSchietekat · ‎06-19-2008

That's a lot of padding! You only need to pad up to a cache line. If you don't want to waste any memory, keep things neatly aligned (C++ will require you to jump through a few hoops to do that in a portable way); if you want quick-and-easy benchmark results, a fixed 64 bytes of padding should be enough (assuming ownership is based on lines and lines are 64 bytes), without any need for cache-related alignment.

I'm wondering how this could possibly be "perfectly" scalable on an architecture that fully orders atomic operations, because there will have to be some form of communication between the cores to make that happen, even without bus locking.

ARCH_R_Intel · ‎06-20-2008

If two trees fall in a sequentially consistent forest, and no one sees them fall, did they fall in order?

The example scales perfectly because no communication is required. Each thread holds its cacheline in the 'Modified' state, meaning no other processor can see it. Lock-free algorithms do require communication.

Amore representative, yet still simple, test would be "Counting Philosophers". Arrange the N threads around a table and have pairs of adjacent threads communicate using a lock-free algorithm. The simplest such algorithm is fetch-and-add on a shared counter. So let each thread continuously increment the counter on its left and decrement the counter on its right. See if that scales.

I recommend the increment/decrement combination because it has a simple invariant. Assuming the counters are zero to start with, after all threads are done the total sum must be zero.

RafSchietekat · ‎06-20-2008

By Schrdinger, you're right!

Dmitry_Vyukov · ‎08-29-2008

aj.guillon@gmail.com:

I did a benchmark with a global atomic variable, which is incremented by N-1 worker threads infinitely, and the Nth thread performs an increment a fixed number of times (100,000,000) and times how long this takes. This should demonstrate the worst performance for a highly-contested atomic variable. The result is very poor, and I'm trying to understand why.... and what to do to improve performance. The test was done on a Linux system, source was compiled with icc, and it's a dual-quad-core xeon with 64-bit extensions, and TBB 2.1. Here are the results:
...
I should really say that past two cores the atomic operations do not scale well. I conducted another test, using distributed atomic operations, where private atomic variables were allocated with each object to be threaded, and each object was allocated with the cache_aligned_allocator to ensure that the atomic objects were on different cache lines. Timing did not seem to improve much, I did not perform a detailed analysis as above. I remember reading something about atomic operations locking the bus, and this would explain the numbers above.

It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.

Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Atomic operations have some fixed price, but this price is not very high, and they scale linearly.

SHARING! Sharing is what has very high price and super-linear negative scalability.
Watch out for SHARING, not for atomic operations.
If you will construct test with plain loads and stores to the same cache-line, then you will get results similar to that of atomic operations, i.e. heavy price and negative scalability. On the other hand, atomic operations on private locations have linear scalability.

Dmitry_Vyukov · ‎08-30-2008

aj.guillon@gmail.com:

Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Wait-free (lock-free) data structures are not about performance/scalability, they are only about forward-progress guarantees. Thus they inevitably use atomic RMW operations everywhere.

There is another class of synchronization algorithms - atomic-free algorithms, they exactly about performance/scalability. But they can provide lock-free, wait-free, obstruction-free, blocking forward guarantees; they can use atomic RMW operations, atomic loads/stores, mutexes, and whatever they want. They are not concerned with forward-progress guarantees, but often they also provide lock-free, or even wait-free guarantees... at least on fast-path.

Dmitry_Vyukov · ‎08-30-2008

aj.guillon@gmail.com:

Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Future hardware won't help! If you have highly-contented centralized design, no hardware can help you to make it scalable. Sharing is not scalable by definition!
If you need scalability, you must design zero-sharing, cache coherence solicitous, data layout solicitous algorithms. And they will do perfectly scale even on current hardware.

ARCH_R_Intel · ‎09-02-2008

Future hardware could help in some circumstances by making some forms of sharing scalable.Specifically, atomic fetch-and-associative-op on a single shared location can be implemented in a scalable way. See "Efficient synchronization of multiprocessors with shared memory", TOPLAS 10(4), 1988.

Whether future hardware will help is a separate question.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
Future hardware could help in some circumstances by making some forms of sharing scalable.Specifically, atomic fetch-and-associative-op on a single shared location can be implemented in a scalable way. See "Efficient synchronization of multiprocessors with shared memory", TOPLAS 10(4), 1988.

Whether future hardware will help is a separate question.

Unfortunately I don't have an ACM account... I can surmise that they are talking about "atomic fetch-and-associative-op on a single shared location which doesn't issue any memory fences". Memory fences usually needed for straightforward synchronization algorithms. And memory fences can be quite expensive. For example on Pentium 4 mfence is about 100 cycles. But on Core architecture mfence is significantly cheaper.
Also assume we have multi-producer/multi-consumer fifo queue. Let's even assume that we make it lightning fast. But the algorithm inherently requires physical movement of user data from thread to thread (from core to core). And this movement won't be fast. The same story with mutexes. Even if mutex itself is lightning fast, what about user data?

Good hardware primitives can make *good* synchronization algorithms even better. But I don't think that they can help to bad synchronization algorithms. Anyway one have to carefully design synchronization algorithm in the first place, keeping in mind a set of available hardware primitives.

ARCH_R_Intel · ‎09-02-2008

CiteSeer has the tech report version of the paper I cited. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.6013. There's a "DOWNLOAD" link in the upper right portion. The paper comes from an age when interconnect could keep up with execution units. Hypercubes, perfect-shuffle networks, and other topologies were hot back then. But even then in the mid 80's I recall the prof. commenting that the computational power that can be crammed in a sphere is proportional to the cube of its radius, but the bandwidth grows as the square. So in the long run localitywould matter in our 3D universe. It didn't take long to get there :-)

That said, a scalable fetch-and-add primitive wouldbe nice to have. It would make concurrent reference counting and related algorithms cheap and easy to implement.

Uzi Vishkin's XMT has some interesting notions for getting back to machines where locality is not so critical, at least on a single chip.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
CiteSeer has the tech report version of the paper I cited. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.6013. There's a "DOWNLOAD" link in the upper right portion. The paper comes from an age when interconnect could keep up with execution units. Hypercubes, perfect-shuffle networks, and other topologies were hot back then. But even then in the mid 80's I recall the prof. commenting that the computational power that can be crammed in a sphere is proportional to the cube of its radius, but the bandwidth grows as the square. So in the long run localitywould matter in our 3D universe. It didn't take long to get there :-)

That said, a scalable fetch-and-add primitive wouldbe nice to have. It would make concurrent reference counting and related algorithms cheap and easy to implement.

Uzi Vishkin's XMT has some interesting notions for getting back to machines where locality is not so critical, at least on a single chip.

Thank you for the links. I need some time to study them.

I am not a hardware guy, but my understanding is that local processing will always be faster than remote communication. No matter how we implement communication. If communication will become faster, then local processing will become faster too. It's endless race. And we are already almost bounded by speed of light wrt communication.

As for new hardware primitives. Well, it would be interesting to see what HTM can afford to us. It seems that HTM is the nearest approaching addition to hardware wrt synchronization. Sun must push HTM to the market in the 2008. And it seems that AMD is seriously working on HTM too. Will it solve all problems wrt synchronization? Will it be suitable for highly scalable algorithms?

AJ13 · ‎09-03-2008

I thought that it was proven impossible to scale algorithms to n processors without the use of an atomic operation somewhere? It doesn't have to be highly contested, it just has to exist somewhere.

AJ13 · ‎09-03-2008

The idea of these benchmarks was to understand how good atomic operations are in real life. They have wonderful promises and properties in theory! The topics addressed in this post are very complex, and are likely to become the subject of my master's thesis. I think that we are having some very interesting and productive conversations.

I have a thought for one scalable data structure which could use atomic instructions very rarely: unordered_vector. The idea would be that each processor gets a handle to the data structure, and that handle gives it a region of memory that it can access freely without contention. Atomic instructions are used to allocate new regions of memory. In reality, this type of container still suffers from the problem that mutexes might be required to traverse the data structure, i.e. when a processor is writing to it's memory region, and another processor is attempting to traverse. However the atomic operations would be used very sparingly.

Dmitry_Vyukov · ‎09-03-2008

aj.guillon@gmail.com:
I thought that it was proven impossible to scale algorithms to n processors without the use of an atomic operation somewhere? It doesn't have to be highly contested, it just has to exist somewhere.

It's perfectly OK to create synchronization algorithm w/o any atomic RMW operations.

As for algorithms which require 'strong competition resolving' between threads, for example mutex (threads have to decide in concord who is the winner, who will enter critical section), or multi-consumer queue (threads have to decide in concord who is the winner, who will consume this particular element). There are 2 ways for 'strong competition resolving':
1. Threads execute atomic RMW operation on one memory location. I hope it's clear, no comments here.
2. Threads execute critical 'store X -> #StoreLoad style membar -> loadY' sequence of operations. One thread stores to X and loads Y, and another thread stores to Y and loads X. The examples are Peterson's mutex algorithm, or SMR (safe memory reclamation) algorithm. It must be noted that 'false positives' are possible here (i.e. both threads will other's store), but 'false negatives' are not possible (both threads miss other's store).

And there also synchronization algorithm which do not require 'strong competition resolving', for example single-producer/single-consumer fifo queue. Such algorithms are based on weak casual relations, and do not require any atomic RMW or #StoreLoad style membars. Usually they are based only on load-acquire/store-release operation, and in the limit they do not require any memory fences at all, only atomic stores and loads.