Most Optimized way to make operations on column of matrix

Itzhak_B_ · ‎08-29-2013

Hi All.

I have matrix m(r,c).

I need to make operations on each column of matrix.

I need to calculate mean of each column, substract it from the column and make fft on the column.

What is best optimized way to do it using IPP?

The subcode of what I need to do:

{

static const unsigned ROWS = 8192, COLUMNS = 4*4096*64;

float m[ROWS ][COLUMNS ];

float sum = 0, mean[COLUMNS];

unsigned r, c;

// calculating mean;

for (c = 0; c < COLUMNS ; ++c) {

sum = 0;

for (r = 0; r < ROWS ; ++r)

sum += m;

mean = sum / ROWS ;

}

// substraction mean

for (c = 0; c < COLUMNS ; ++c) {

for (r=0; r < 64; ++r)

m -= mean;

}

// calculate fft on each column of matrix

...

}

It was simple to do all that on row of matrix because IPP function use input parameter as array of float.

So One way is to do code above is just transpose the matrix and make all the operations( mean, substract and fft) on trasposed matrix.

But it seems heavy operations.

There is some IPP functions that can make operations (mean, substract and fft) on column of matrix.

Thank you,

Itzhak

SergeyKostrov · ‎08-29-2013

Transpose the matrix and performance will be better in many times ( the best example is a Transposed Based matrix multiplication ).

Itzhak_B_ · ‎09-01-2013

OK. Thank you.

I will try do it by transposing the matrix.

Igor_A_Intel · ‎09-05-2013

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

SergeyKostrov · ‎09-05-2013

>>...I will try do it by transposing the matrix... Since the input matrix size is 8192x1048576 you could also use Loop-Blocking Optimization technique with unrolling of iterations to improve performance of calculations after transpose is completed.

Itzhak_B_ · ‎09-11-2013

Igor Astakhov (Intel) wrote:

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

Igor, Thank you.

It will not improve perfomance a lot because I can use temporal buffer only for transpose and calculating mean of 16 columns.

In order to subtract mean and calculate FFT I need to use all columns.

Regards,

Itzhak