Atomic Increment Benchmark on Intel Hardware - Page 2

AJ13 · ‎06-16-2008

Hey all,

I've been reading "The Art of Multiprocessor Programming" which seems to have nothing but good things to say about atomic operations. They sound nice, but as a pessimist I did a benchmark today (the code is at the bottom of this post).

I did a benchmark with a global atomic variable, which is incremented by N-1 worker threads infinitely, and the Nth thread performs an increment a fixed number of times (100,000,000) and times how long this takes. This should demonstrate the worst performance for a highly-contested atomic variable. The result is very poor, and I'm trying to understand why.... and what to do to improve performance. The test was done on a Linux system, source was compiled with icc, and it's a dual-quad-core xeon with 64-bit extensions, and TBB 2.1. Here are the results:

#cores	Time(s)	Final Counter Value	#increments/s	Speedup over 1 Core
1	1.43	100000000	70141475.36	1
2	5.98	847055237	141611090.92	2.02
3	11.8	901812169	76405982.34	1.09
4	12.06	750338611	62196502.9	0.89
5	15.18	836765059	55139935.22	0.79
6	18.73	945415180	50477870.5	0.72
7	18.01	948627422	52678696.01	0.75
8	54.63	2379871615	43564811.92	0.62

I should really say that past two cores the atomic operations do not scale well. I conducted another test, using distributed atomic operations, where private atomic variables were allocated with each object to be threaded, and each object was allocated with the cache_aligned_allocator to ensure that the atomic objects were on different cache lines. Timing did not seem to improve much, I did not perform a detailed analysis as above. I remember reading something about atomic operations locking the bus, and this would explain the numbers above.

It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.

Comments? It seems that wait-free data structures which use atomic operations will really require future hardware to have be tter (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Also, I love having tbb::tbb_thread for these kinds of benchmarks!

Source code here:

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include

const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include

const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address

#include
#include
#include

#include
#include
#include
// I forgot #include but icc compiled anyways
const long long numIncrementsForTrial = 100000000;

tbb::atomic atomicRegister;

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

& nbsp; void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increme nt timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
int _id;
public:
InfinitelyIncrement(int id) : _id(id)
{
}

void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
atomicRegister.fetch_and_increment();
}
}
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }

void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer

tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

std::cout << "Starting timer..." << std::endl;

tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
atomicRegister.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;

}
};

int main(int argc, char** argv)
{

int numThreads;

if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;

}
numThreads = atoi(argv[1]);

std::cout << "Running with " << numThreads << " threads." << std::endl;

std::vector< std::pair<:TBB_THREAD> > threadCollection;

for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}

AtomicIncrementTimer timerObject(numIncrementsForTrial);

tbb::tbb_thread timerThread( timerObject );

timerThread.join();

std::cout << "Final Value: " << atomicRegister << std::endl;

// Don't bother reclaiming memory, because there is no way to terminate
// the threads.

return 0;

}

Dmitry_Vyukov · ‎09-03-2008

aj.guillon@gmail.com:
The idea of these benchmarks was to understand how good atomic operations are in real life. They have wonderful promises and properties in theory!

Atomic operations have good properties in practice too... at least from the moment when we get cache-line locking instead of bus locking.
SHARING is what has terrible properties in practice... and usually passed in silence in theory.

aj.guillon@gmail.com:

I have a thought for one scalable data structure which could use atomic instructions very rarely: unordered_vector. The idea would be that each processor gets a handle to the data structure, and that handle gives it a region of memory that it can access freely without contention. Atomic instructions are used to allocate new regions of memory. In reality, this type of container still suffers from the problem that mutexes might be required to traverse the data structure, i.e. when a processor is writing to it's memory region, and another processor is attempting to traverse. However the atomic operations would be used very sparingly.

If you will use mutexes for reading than you can see benchmark results of your future algorithm here:
http://software.intel.com/en-us/forums//topic/60494
(just replace words 'TBB concurrent_hash_map' with 'unordered_vector')

In order to get scalable solution you have to figure out how you can eliminate *ANY* modifications of shared data from fast-path of most prioritized operation (usually it's reading). Even if it sounds unrealistically, it's possible even for writing.