- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All.
I have matrix m(r,c).
I need to make operations on each column of matrix.
I need to calculate mean of each column, substract it from the column and make fft on the column.
What is best optimized way to do it using IPP?
The subcode of what I need to do:
{
static const unsigned ROWS = 8192, COLUMNS = 4*4096*64;
float m[ROWS ][COLUMNS ];
float sum = 0, mean[COLUMNS];
unsigned r, c;
// calculating mean;
for (c = 0; c < COLUMNS ; ++c) {
sum = 0;
for (r = 0; r < ROWS ; ++r)
sum += m
mean
}
// substraction mean
for (c = 0; c < COLUMNS ; ++c) {
for (r=0; r < 64; ++r)
m
}
// calculate fft on each column of matrix
...
}
It was simple to do all that on row of matrix because IPP function use input parameter as array of float.
So One way is to do code above is just transpose the matrix and make all the operations( mean, substract and fft) on trasposed matrix.
But it seems heavy operations.
There is some IPP functions that can make operations (mean, substract and fft) on column of matrix.
Thank you,
Itzhak
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK. Thank you.
I will try do it by transposing the matrix.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi Itzhak,
it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Astakhov (Intel) wrote:
hi Itzhak,
it's better (from the performance point of view) to perform transpose of 8 or 16 columns at once to a temporal buffer and then execute all required operations in this buffer - in such way you'll guarantee data locality in L1. 8 or 16 depends on your data - complex or real - in order to load 64 aligned bytes from each row - it's L1 cache row width. IPP realization of 2D FFT uses such technique internally.
regards, Igor
Igor, Thank you.
It will not improve perfomance a lot because I can use temporal buffer only for transpose and calculating mean of 16 columns.
In order to subtract mean and calculate FFT I need to use all columns.
Regards,
Itzhak
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page