Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
6747 Discussions

Most Optimized way to make operations on column of matrix

Itzhak_B_
Beginner
876 Views

Hi All.

I have matrix m(r,c).

I need to make operations on each column of matrix.

I need to calculate mean of each column, substract it from the column and make fft on the column.

What is best optimized way to do it using IPP?

The subcode of what I need to do:

{

    static const unsigned ROWS = 8192, COLUMNS = 4*4096*64;

    float m[ROWS ][COLUMNS ];

    float sum = 0, mean[COLUMNS];

    unsigned r, c;

    // calculating mean;

    for (c = 0; c < COLUMNS ; ++c) {

        sum = 0;

        for (r = 0; r < ROWS ; ++r)

            sum += m;

        mean = sum / ROWS ;

    }

    // substraction mean

    for (c = 0; c < COLUMNS ; ++c) {

        for (r=0; r < 64; ++r)

            m -= mean;

    }

    // calculate fft on each column of matrix

    ...

}

It was simple to do all that on row of matrix because IPP function use input parameter as array of float.

So One way is to do code above is just transpose the matrix and make all the operations( mean, substract and fft) on trasposed matrix.

But it seems heavy operations.

There is some IPP functions that can make operations (mean, substract and fft) on column of matrix.

Thank you,

Itzhak

0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
876 Views
Transpose the matrix and performance will be better in many times ( the best example is a Transposed Based matrix multiplication ).
0 Kudos
Itzhak_B_
Beginner
876 Views

OK. Thank you.

I will try do it by transposing the matrix.

0 Kudos
Igor_A_Intel
Employee
876 Views

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

0 Kudos
SergeyKostrov
Valued Contributor II
875 Views
>>...I will try do it by transposing the matrix... Since the input matrix size is 8192x1048576 you could also use Loop-Blocking Optimization technique with unrolling of iterations to improve performance of calculations after transpose is completed.
0 Kudos
Itzhak_B_
Beginner
876 Views

Igor Astakhov (Intel) wrote:

hi Itzhak,

it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.

regards, Igor

Igor, Thank you.

It will not improve perfomance a lot because I can use temporal buffer only for transpose and calculating mean of 16 columns.

In order to subtract mean and calculate FFT I need to use all columns.

Regards,

Itzhak

0 Kudos
Reply