cpu performance w/a simple gather utility - Page 2

Petros_Mamales · ‎08-16-2011

Hi,
I have an i7 965 cpu running on win7 and using tbb 3.0.
The attached program, is designed to gather into a (clumn format) matrix elements that live in different buffers
(in this case a bigger matrix).
I am trying to see how many processing units (out of 8?) are engaged in the calculation. From the task mgr
I see at most 4.

[cpp]//stl
#include 
using std::vector;
//tbb
//tbb
#include "tbb\parallel_for.h"
#include "tbb\parallel_reduce.h"
#include "tbb\blocked_range.h"
#include "tbb\blocked_range2d.h"

// function to gather data into a single matrix as columns
template< typename _M, typename _T>
void gather_data_into_matrix_columns( _M & m, vector<_T*> const & dataStart, vector const & dataStride ){

	struct column_gatherer{
		_M & m_;
		vector<_T*> const & dataStart_;
		vector const & dataStride_;
		
		column_gatherer( _M & m, vector<_T*> const & dataStart, vector const & dataStride )
			:m_(m), dataStart_(dataStart), dataStride_(dataStride) 
		{}
		void operator()( const tbb::blocked_range2d< size_t > & r ) const {
			size_t rowStart = r.rows().begin();
			size_t rowEnd = r.rows().end();
			size_t colStart = r.cols().begin();
			size_t colEnd = r.cols().end();

			for ( size_t colIdx = colStart; colIdx != colEnd; ++colIdx ){
				size_t stride = dataStride_[colIdx];
				_T const * ptr = dataStart_[colIdx] + rowStart * stride;
				for ( size_t rowIdx = rowStart; rowIdx != rowEnd; ++rowIdx ){
					m_( rowIdx, colIdx ) = *ptr;
					ptr += stride;
				}
			}
		}
	};

	static const size_t GRAIN_SIZE_ROW = 10;
	static const size_t GRAIN_SIZE_COLUMN = 10;

	column_gatherer cg( m, dataStart, dataStride );
	tbb::parallel_for( 
			tbb::blocked_range2d( 
				0, m.size1(), GRAIN_SIZE_ROW, 
				0, m.size2(), GRAIN_SIZE_COLUMN
			),
			cg, 
			tbb::auto_partitioner() );
}

#include 
using std::cout;
using std::endl;

#include 
using boost::numeric::ublas::matrix;

bool test_GatherScatter(){
	const size_t nRows = 80;
	const size_t nCols = 20; 
	matrix< double, boost::numeric::ublas::column_major> m(nRows,nCols);

	const size_t nRowsBig = 800;
	const size_t nColsBig = 100; 
	matrix< double, boost::numeric::ublas::column_major> mBig(nRowsBig,nColsBig);
	for ( size_t i = 0; i != nRowsBig; ++i ){
		for ( size_t j = 0; j != nColsBig; ++j ){
			mBig(i,j) = double(rand())/double(RAND_MAX) - 0.5L;
		}
	}

	vector< double *> colStart( nCols );
	vector< size_t > strides( nCols );
	for ( size_t i = 0; i != nCols; ++i ) {
		colStart = &(mBig(0,3*i));
		strides = 5;
	}

	gather_data_into_matrix_columns( m, colStart, strides );

	for( size_t i = 0; i != nRows; ++i ){
		for( size_t j = 0; j != nCols; ++j ){
			if ( m(i,j) != mBig( 5*i, 3*j) ){
				cout << "(i, j) = (" << i << ", " << j << ")" << endl;
				return false;
			}
		}
	}
	return true;

}

int main(){
	bool b;
	for ( size_t i = 0; i != 1000; ++i )
	b = test_GatherScatter();
}

[/cpp]

Is this a reliable test? Is there a better one than eyeballing it ;)) ?
The code is a concatenation of various files. Conceptually gives the right result. Please, let me know if there is something more you might need.
Thank you very much for your help,
Petros

ps: the motivation for this is to bring into one place (typically a matrix) data that will constitute the rhs of lapack linear equation solver.

RafSchietekat · ‎09-01-2011

"The Fibbonacci sample program does this. **** but you would not want to solve the series using this method *** (there is an orders of magnitude faster serial method)."
You can choose anything from exponential down to logarithmic staying fully in the integral domain, and Binet's formula looks even faster (although I suspect that in practice it isn't).

But I mainly added this reply because I just remembered strings as an example of where it's often faster to copy than to use copy-on-write, although that's because of the cost of the atomic operations for maintaining the reference count on a certain very popular processor architecture (it eludes me why weaker but cheaper operations have not been added yet), not because of memory access characteristics (either).

Jim, I repeat my question: would at least NUMA sometimes mandate explicit copying? I expect there should be at least some overhead related to the chatter of keeping caches coherent when not using private copies even for read-only memory regions (which the processors wouldn't generally know without heavy manipulations of virtual-memory tables), but I have no idea how significant it would be and whether it would pay off to avoid it.

jimdempseyatthecove · ‎09-02-2011

>>Jim, I repeat my question: would at least NUMA sometimes mandate explicit copying?

When your program performs write to memory the write alters cache and RAM (excepting for streaming store which alters only L1 and RAM as opposed to including last level cache). NUMA systems can be hierarchical and as such the memory your program accesses might be within the current NUMA node, one node distance, two nodes distant, ... The further the NUMA node is from the requesting processor the longer the request becomes.

When your data access is predominantly: repeated read-only in cache sized regions
Then copying is of little value (unless the copy is from scattered to contiguous)

When your data access is predominantly: repeated read/modify/write
Then copying may have value.

This does make the requirement that the other threads avoid using this data when it is not residing in its "usual" locations.

RE: which the processors wouldn't generally know without heavy manipulations of virtual-memory tables

This is usualy not material since the virtual memory tables are process related as opposed to processor related.

Jim Dempsey