Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Itzhak_B_
Beginner
110 Views

Most Optimized way to make operations on column of matrix

Hi All.

I have matrix m(r,c).

I need to make operations on each column of matrix.

I need to calculate mean of each column, substract it from the column and make fft on the column.

What is best optimized way to do it using IPP?

The subcode of what I need to do:

{

    static const unsigned ROWS = 8192, COLUMNS = 4*4096*64;

    float m[ROWS ][COLUMNS ];

    float sum = 0, mean[COLUMNS];

    unsigned r, c;

    // calculating mean;

    for (c = 0; c < COLUMNS ; ++c) {

        sum = 0;

        for (r = 0; r < ROWS ; ++r)

            sum += m;

        mean = sum / ROWS ;

    }

    // substraction mean

    for (c = 0; c < COLUMNS ; ++c) {

        for (r=0; r < 64; ++r)

            m -= mean;

    }

    // calculate fft on each column of matrix

    ...

}

It was simple to do all that on row of matrix because IPP function use input parameter as array of float.

So One way is to do code above is just transpose the matrix and make all the operations( mean, substract and fft) on trasposed matrix.

But it seems heavy operations.

There is some IPP functions that can make operations (mean, substract and fft) on column of matrix.

Thank you,

Itzhak

0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
110 Views

Transpose the matrix and performance will be better in many times ( the best example is a Transposed Based matrix multiplication ).
Itzhak_B_
Beginner
110 Views

OK. Thank you.

I will try do it by transposing the matrix.

Igor_A_Intel
Employee
110 Views

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

SergeyKostrov
Valued Contributor II
110 Views

>>...I will try do it by transposing the matrix... Since the input matrix size is 8192x1048576 you could also use Loop-Blocking Optimization technique with unrolling of iterations to improve performance of calculations after transpose is completed.
Itzhak_B_
Beginner
110 Views

Igor Astakhov (Intel) wrote:

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

Igor, Thank you.

It will not improve perfomance a lot because I can use temporal buffer only for transpose and calculating mean of 16 columns.

In order to subtract mean and calculate FFT I need to use all columns.

Regards,

Itzhak

Reply