Performance of C++ 2D array iterator dereferencing

tj1 · ‎06-29-2012

I have a row-major iterator to a 2D array, with dereference operator as follows:

int& Iterator::operator*()
{
return matrix_[y_][x_]; //matrix_ has type int**
}

The (prefix) increment operator is as follows:

Iterator& Iterator::operator++()
{
//ys_, xs_ are dimensions of matrix
if((++x_ == xs_) && (++y_ != ys_)) x_ = 0;
return *this;
}

I can use this iterator with an optimised version of std::transform (mine doesn't return an un-needed result, in order to save a few instructions)

template < class InputIterator, class OutputIterator, class UnaryOperator >
inline void MyTransform( InputIterator first1, InputIterator last1,OutputIterator result,
UnaryOperator op )
{
for (; first1 != last1; ++first1, ++result)
*result = op(*first1);
}

calling it thus:

MyTransform(matrix1.begin(),matrix1.end(),matrix2.begin(), MyFunctor());

However, when I compare the performance to a classic, nested for-loop, to wit:

MyFunctor() f;

for (int y=0; y for (int x=0; x matrix2. = f(matrix1.);

the iterator-based solution is approx. 25% slower than the nested for-loop solution.

Now the problem does not seem to be with the iterator increment operator, as if I do the following (ugly) hybrid solution combining iterator-traversal and raw array access (the arrays being indexed using the iterators' internal counts):

MyFunctor func;

for (; mat1Begin != mat1End; ++mat1Begin, ++mat2Begin)
{
//mat1 and mat2 are type int**
mat2[mat2Begin.y_][mat2Begin.x_] =
func(mat1[mat1Begin.y_][mat1Begin.x_]);
}

this hybrid is actually a little faster than the nested for-loop solution, suggesting the actual iterator traversal is fast. This suggests to me that the performance hit is in the iterator's dereferencing when doing the assignment.

My question is, why does dereferencing the iterators in the assignment

*result = op(*first1);

incur such a massive performance hit, relative to raw array access? Is there any technique I can use for this simple design, to get performance (almost) equivalent to the raw array version? I much prefer to use iterators where possible but this is for a very performance-sensitive piece of code, for which such a large hit is not tolerable.

Any help much appreciated.

(I'm using evaluation version of Intel Parallel Studio XE 2011 with Visual Studio 10, with default release configuration optimisations.)

Arthur_Moroz · ‎07-02-2012

I suppose that performance lose is here:

if((++x_ == xs_) && (++y_ != ys_)) x_ = 0;

You do this check twice inside each loop pass.

I would suggest to represent 2D array as simple int* pointing to xs_*ys_*sizeof(int) allocated memory. Than for increment you just increment iterator's (aggregated) pointer. You will save time for checks and calculating offset twice.

SergeyKostrov · ‎07-02-2012

Quoting tj1

...
for (int y=0; y for (int x=0; x matrix2. = f(matrix1.);
the iterator-based solution is approx. 25% slower than the nested for-loop solution...

That performance hit is not a surprise for me because if you really want to achieve as better as possible
performance it is notrecommended to use any C++ operators. I usetwo pointers in order to access
2-D data sets and they are initialized as follows:

[cpp] ... T *m_ptData1D; T **m_ptData2D; ... ... m_ptData1D = ( T * )CrtMalloc( m_uiSize * sizeof( T ) ); if( m_ptData1D == RTnull ) return ( RTbool )RTfalse; m_ptData2D = ( T ** )CrtMalloc( m_uiRows * sizeof( T * ) ); if( m_ptData2D == RTnull ) return ( RTbool )RTfalse; T *ptData = m_ptData1D; for( RTuint i = 0; i < m_uiRows; i++ ) { m_ptData2D = ptData; ptData += m_uiCols; } ... [/cpp]

tj1 · ‎07-03-2012

Hi, Arthur,

The code in your response short-circuits if the first Boolean evaluates to false, so actually is fairly optimal. In any case, that code is from the incrementing operator.As stated in my original post, that is not where the problem lies, it is with the peformance of the dereferencing operator.

I have modified the code so that the outer counter of the loop is cached, so the code now looks as follows:

int& Iterator::operator*()
{
return column_[x_];
}

//prefix incr.
Iterator& Iterator::operator++()
{
if(++x_ == xs_)
{
if(++y_ != ys_)
{
x_ = 0;
column_ = matrix_[y_];
}
}
return *this;
}

This improves performance to ~85% of the raw 2D array performance.

When I convert the code to using pointer arithmetic (again caching the column) as follows, the performance is significantly worse than the raw 2D array (~ 55%):

//dereference
int& Image32Iterator::operator*()
{
return *ptr_;
}

//prefix
Image32Iterator& Image32Iterator::operator++()
{
if(++ptr_ == ptrEnd_)
{

if(++row_ != rowEnd_)
{
ptrEnd_ = (ptr_ = *row_) + xs_;
}
}
return *this;
}

I am surprised that the pointer arithmetic solution performs so much worse, and don't quite understand. I was trying to use pointer arithmetic to see if I could get > 85%!

tj1 · ‎07-03-2012

Hi, Sergey,
Thanks, that is actually the structure I use internally for my 2D array, however I wanted to use a row-majoriterator pattern as then it can be used with STL algorithms such as std::transform. As stated in my original post, the C++ increment operator I implemented performed well, it was just the dereferencing operator which underperformed. I am prepared to sacrifice a little performance, to make code moregeneric and typesafe, the question is how much should I expect to have to sacrifice, which is what my investigations are aiming at finding out.

As discussed in my response to Arthur's post, I improved performance by caching the outer row counter, which improved performance to ~85%.

Strangely, however, when I use pointer arithmetic (plus caching outer counter) in my iterator, the performance is actually worse, at around 55 - 60%.

Arthur_Moroz · ‎07-06-2012

Actually, my main idea was to use plain continuous memory, you don't need to allocate an array per column (or row). Just allocate Xs * Ys elements.

class Array2D
{
int* pArray;
int size, width;
public:
Array2D(int xs, int ys)
: width(xs), size(xs*ys), pArray(new int[xs*ys])
{}
. . .
class iterator
{
int* ptr;
public:
iterator(Array2D& arr, int offs=0)
: ptr( arr.pArray + offs )
{}
inline iterator& operator++() { ++ptr; return *this; }
inline int& operator*() { return *ptr; }
};

friend class iterator;
inline iterator begin() { return iterator(*this, 0); }
inline iterator end() { return iterator(*this, size); }
};

It's just a scheme, you will need to add ctors, dtors and validity checks. But I hope this scheme will explain better what I've tried to say in words.

SergeyKostrov · ‎07-06-2012

A whileago I've provided a prototype of C++ class that allows to do transforms with data sets. Please take a look:

IPP forum: Reformatting data using strides

http://software.intel.com/en-us/forums/showpost.php?p=174994

The C++ class has a different kind of 'transform' methodcompared to STL 'transform' method and it simply
demonstrates how a data set is initialized, transformed, released, etc.

SergeyKostrov · ‎07-06-2012

Quoting Sergey Kostrov

A whileago I've provided a prototype of C++ class that allows to do transforms with data sets...

// Simple 2D Data Set class ( allows transforms )
//
// Notes:
// - This is a prototype I used for a template based 2Ddata set class ( it's avery different but idea is the same )
// - An underlying 1D array for a 2D array is a CONTIGUOUS
// - A Transform functionality assumes that the underlying 1D array is not reallocated
// - You could easily add methods like 'SetValue', 'Clear', 'LoadData', C++ operators, etc

class CDataSet2D
{
public:
CDataSet2D()
{
Init();
};

virtual ~CDataSet2D()
{
Free();
};

private:
void Init( void )
{
m_iRows = 0;
m_iCols = 0;

m_piData1D = NULL;
m_piData2D = NULL;
};

public:
int Allocate( int iRows, int iCols )
{
if( iRows <= 0 || iCols <= 0 )
return ( int )0;

m_iRows = iRows;
m_iCols = iCols;

m_piData1D = ( int * )malloc( ( m_iRows * m_iCols ) * sizeof( int ) );
if( m_piData1D == NULL )
return ( int )0;

memset( m_piData1D, 0x0, ( m_iRows * m_iCols ) * sizeof( int ) );

m_piData2D = ( int ** )malloc( m_iRows * sizeof( int * ) );
if( m_piData2D == NULL )
return ( int )0;

int *piData = m_piData1D;

for( int i = 0; i < m_iRows; i++ )
{
m_piData2D = piData;
piData += m_iCols;
}

return ( int )1;
};

void Free( void )
{
if( m_piData2D != NULL )
{
free( m_piData2D );
m_piData2D = NULL;
}

if( m_piData1D != NULL )
{
free( m_piData1D );
m_piData1D = NULL;
}
};

int Transform( int iRows, int iCols )
{
if( iRows <= 0 || iCols <= 0 )
return ( int )0;
if( m_iRows <= 0 || m_iCols <= 0 )
return ( int )0;

if( ( m_iRows * m_iCols ) != ( iRows * iCols ) )
return ( int )0;

if( m_piData1D == NULL )
return ( int )0;
if( m_piData2D == NULL )
return ( int )0;

m_iRows = iRows;
m_iCols = iCols;

if( m_piData2D != NULL )
{
free( m_piData2D );
m_piData2D = NULL;
}

m_piData2D = ( int ** )malloc( m_iRows * sizeof( int * ) );
if( m_piData2D == NULL )
return ( int )0;

int *piData = m_piData1D;

for( int i = 0; i < m_iRows; i++ )
{
m_piData2D = piData;
piData += m_iCols;
}

return ( int )1;
};

protected:
int m_iRows;
int m_iCols;

int *m_piData1D;
int **m_piData2D;
};

void main( void )
{
int iRetCode = -1;

CDataSet2D ds2D;

iRetCode = ds2D.Allocate( 5, 5 ); // Initialized as Matrix 5x5
iRetCode = ds2D.Transform( 1, 25 ); // Transformed to Array 1x25
iRetCode = ds2D.Transform( 25, 1 ); // Transformed to Vector 25x1
iRetCode = ds2D.Transform( 7, 7 ); // Attempt to Transform to Matrix 7x7 - Invalid input
}