Slow run-time

wolfpackNC · ‎10-24-2011

I was running into a really slow run-time issue that popped up recently. I have this subroutine that gets called a lot, and had added this do loop recently. Before I was not pulling data from this large array cvtrkp(j,k,1,i,isfc,igss,iobj)

where,

real*8cvtrkp(6,6,3,maxtap,maxsfc,maxrdr,maxobj)

maxtap=100

maxsfc=85

maxrdr=40

maxobj=2

do j=1,6
do k=1,6
covar(j,k,1)=cvtrkp(j,k,1,i,isfc,igss,iobj)
covar(j,k,2)=cvtrkp(j,k,2,i,isfc,igss,iobj)
covar(j,k,3)=cvtrkp(j,k,3,i,isfc,igss,iobj)
enddo
enddo

So I guess my question is, is there an inherent run-time penalty when accessing data from large arrays like this and storing it in local temp array?

Thanks

TimP · ‎10-24-2011

Any reason for nesting the loops backwards? With the right options, the compiler may take care of that, but you should check what it did, if you don't want to make it easy.

wolfpackNC · ‎10-24-2011

Wait. Sorry, what do you mean nesting backwards? So it's prefered to have the inner most loop on the far left index. That makes sense in terms of fortran ordering on memory.

Also, I don't follow "but you should check what it did, if you don't want to make it easy."

Sorry, please explain.

Thanks!

no, there was no good reason for me to do that.
Ok this makes perfect sense to me now. Never thought about, but it makes sense. Ran a little test program to confirm. I will now begin changing all loops.

Thanks!

mecej4 · ‎10-24-2011

The ideal nesting of nested DO loops has the first index (in left-to-right order) of a multi-dimensional array varying in the innermost loop, the second index in the second innermost loop, etc.

The compiler can reorder loops, if it can determine that the reordering is safe and if you have specified an optimization level that requests such reordering.

wolfpackNC · ‎10-24-2011

Thanks everyone for the quick response! This helps a lot!

SergeyKostrov · ‎10-24-2011

...is there an inherent run-time penalty when accessing data from large arrays like this and storing it in local temp array...

That's an amazing question!

My answer is based on my today'soptimization problemsbecause Iran intoperformance issues with a template based C++ codes.

An applicationneeds to do some processing witha large 2-D data set of floats declared locallyona stack.

Just for interest I changed the declaration to'static', that is, global and allocated only once, and there was a performance degradation.Itwas almosttwice slowerto calculatea Kroneker's product of two matrices.

Actually, I expected some performance gains but result was opposite!

In general, I would strongly recommend to test your applicationas better as possiblebut in my case I clearly had more problems (cache misses ).

Best regards,
Sergey

PS: An example of code is here and I bolded and underlined two lines of codes where I had some issues:

...

inline RTbool Kronecker( const TMatrixSet< T, iDataType > &rtMs )
{
if( TDataSet< T, iDataType >::m_ptData1D == RTnull ) // [ MxN ] * [ RxK ] = [ MRxNK ]
return ( RTbool )RTfalse;

if( TDataSet< T, iDataType >::m_ptData2D == RTnull )
return ( RTbool )RTfalse;

RTint iM = ( RTint )TDataSet< T, iDataType >::m_uiRows;
RTint iN = ( RTint )TDataSet< T, iDataType >::m_uiCols;
RTint iR = ( RTint )rtMs.m_uiRows;
RTint iK = ( RTint )rtMs.m_uiCols;

if( iM == 0 || iN == 0 || iR == 0 || iK == 0 )
return ( RTbool )RTfalse;

TMatrixSet< T, iDataType > tMsTmp;
tMsTmp.SetSize( ( iM * iR ), ( iN * iK ) );

tMsTmp.m_enMatrixTranspose = m_enMatrixTranspose;

RTint m,n,r,k;
RTint mr = 0;
RTint nk = 0;

for( m = 0; m < iM; m++ )
{
for( r = 0; r < iR; r++ )
{
nk = 0;
for( n = 0; n < iN; n++ )
{
for( k = 0; k < iK; k++ )
{
tMsTmp.m_ptData2D[mr][nk] = TDataSet< T, iDataType >::m_ptData2D *
rtMs.m_ptData2D;
nk++;
}
}
mr++;
}
}

*this = tMsTmp;

return ( RTbool )RTtrue;
};
...