- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I'm starting to developing/improving a low level image processing library highly optimized for speed. The goal is to replace the low level C/ASM Version with a better design using C++. So it is very important that the auto vectorization works well. So I created a small test programm implementing a simple point operator. The C++ version uses a function (filter) representing the operator which is called in the inner loop:
This inner loop is not vectorized in the C++ version, so it is slower (CPU: Intel Pentium M processor 1.80GHz):
Time C++ Version: 0.65s, MPixel/s: 240.385, Sum: 1
Time C Version: 0.28s, MPixel/s: 558.036, Sum: 1
The compiler says "filter.cpp(61): (col. 30) remark: loop was not vectorized: unsupported loop structure". When I replace "img->getWidth()" with a constant, it is also not vectorized: "filter.cpp(61): (col. 9) remark: loop was not vectorized: existence of vector dependence".
Is there a solution for this problem without falling back to the low level version? It is important to have a filter function that can be easily exchanged and is easy to write.
Simon
Compiler options: -funroll-loops -mtune=pentium-m -msse2 -O3 -unroll-aggressive -fomit-frame-pointer -ffunction-sections -ipo -vec-report3
Compiler version: 11.0
I'm starting to developing/improving a low level image processing library highly optimized for speed. The goal is to replace the low level C/ASM Version with a better design using C++. So it is very important that the auto vectorization works well. So I created a small test programm implementing a simple point operator. The C++ version uses a function (filter) representing the operator which is called in the inner loop:
[cpp]#include#include #include using namespace std; int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p) { return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) - ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec); } class Image { private: int width; int height; uint8_t *data; public: Image(int w, int h) { width = w; height = h; data = new uint8_t[w*h]; } int getWidth() { return width; } int getHeight() { return height; } uint8_t *operator[] (int row) { return &data[row * width]; } ~Image() { delete [] data; } }; void filter(Image &img, int x, int y) { img = img + 1; } int main(int argc, char *argv[]) { const int WIDTH = 8192; const int HEIGHT = 20000; Image *img = new Image(WIDTH, HEIGHT); boost::timer timer = boost::timer(); //C++ Version: timer.restart(); int sum = 0; for (int y=0; y getHeight(); y++) { for (int x=0; x getWidth(); x++) filter(*img, x, y); } sum = *img[0][0]; double stop = timer.elapsed(); cout << "Time C++ Version: " << stop << "s, MPixel/s: " << (img->getWidth() * img->getHeight()) / stop / 1024.0 / 1024.0 << ", Sum: " << sum << endl; delete img; //Low-level C Version: uint8_t *data = (uint8_t *) malloc(sizeof(uint8_t) * WIDTH * HEIGHT); timer.restart(); for (int y=0; y<< "Time C Version: " << stop << "s, MPixel/s: " << (WIDTH * HEIGHT) / stop / 1024.0 / 1024.0 << ", Sum: " << sum << endl; return 0; } [/cpp]
This inner loop is not vectorized in the C++ version, so it is slower (CPU: Intel Pentium M processor 1.80GHz):
Time C++ Version: 0.65s, MPixel/s: 240.385, Sum: 1
Time C Version: 0.28s, MPixel/s: 558.036, Sum: 1
The compiler says "filter.cpp(61): (col. 30) remark: loop was not vectorized: unsupported loop structure". When I replace "img->getWidth()" with a constant, it is also not vectorized: "filter.cpp(61): (col. 9) remark: loop was not vectorized: existence of vector dependence".
Is there a solution for this problem without falling back to the low level version? It is important to have a filter function that can be easily exchanged and is easy to write.
Simon
Compiler options: -funroll-loops -mtune=pentium-m -msse2 -O3 -unroll-aggressive -fomit-frame-pointer -ffunction-sections -ipo -vec-report3
Compiler version: 11.0
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
as an experiment, what happens with:
int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y
for (int x=0; x
}
sum = *img[0][0];
then
int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y
uint8_t row = *img
for (int x=0; x
}
sum = *img[0][0];
Sometimes you have to help the compiler optimization by hand
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
as an experiment, what happens with:
int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y
for (int x=0; x
}
sum = *img[0][0];
then
int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y
uint8_t row = *img
for (int x=0; x
}
sum = *img[0][0];
Sometimes you have to help the compiler optimization by hand
Jim Dempsey
The last version works:
Time C++ Version: 0.25s, MPixel/s: 625, Sum: 1
Time C Version: 0.25s, MPixel/s: 625, Sum: 1
But the problem is that without the function call with access to the hole image (not just one row), it is not possible to develop an flexible framework. I want to separate the filter (development) and the calling of the filter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you set -ansi-alias? It's difficult for a compiler to optimize such things if you tell it not to assume your code complies with standards on aliasing. In this case, the only way that could be a legitimate issue is due to the "array of arrays" implications of your code, and for the -ansi-alias option to help, it would have to distinguish data type int from data type int *.
In case this doesn't help, I've noticed also that g++ sometimes does a better job than icpc of vectorizing based on in-lining a function inside the inner loop, and I've worn out my welcome filing compiler performance issues on this. Customer interest in vectorization under such circumstances is often assumed to be low, so it wouldn't hurt to file an issue on premier.intel.com explaining the advantages of optimizing here.
In case this doesn't help, I've noticed also that g++ sometimes does a better job than icpc of vectorizing based on in-lining a function inside the inner loop, and I've worn out my welcome filing compiler performance issues on this. Customer interest in vectorization under such circumstances is often assumed to be low, so it wouldn't hurt to file an issue on premier.intel.com explaining the advantages of optimizing here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you considered making the filter function called from within the array of objects as opposed to called for each object. i.e. the filter function is passed the array of objects.
To put it in other words, the filter is passed a vector of objects. This would be problematic with opaque objects.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are reasons why C or ASM are used in some cases and for some tasks.
C++ is in my opinion ill-suited for this kind of work, not to mention that you will be risking regressions in performance critical code because the next compiler version might not vectorize/optimize the code in the same way the previous compiler version did.
If you really want to trust the performance critical parts of the code to the external uncontrollable factor then go ahead. Otherwise, you need to rethink your approach and to determine what are you trying to accomplish and how much gain can be expected from doing it.
Finally, if the above code sample can be taken as a measure of overall code performance I would dare to say that you are wasting your time. Library which uses new[] for allocating (unaligned) memory for image processing doesn't sound too optimized to me.
If I were you, I would consider using IPP for image processing rather than reinventing the wheel.
C++ is in my opinion ill-suited for this kind of work, not to mention that you will be risking regressions in performance critical code because the next compiler version might not vectorize/optimize the code in the same way the previous compiler version did.
If you really want to trust the performance critical parts of the code to the external uncontrollable factor then go ahead. Otherwise, you need to rethink your approach and to determine what are you trying to accomplish and how much gain can be expected from doing it.
Finally, if the above code sample can be taken as a measure of overall code performance I would dare to say that you are wasting your time. Library which uses new[] for allocating (unaligned) memory for image processing doesn't sound too optimized to me.
If I were you, I would consider using IPP for image processing rather than reinventing the wheel.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page