Re: C++ Vectorization

lrep · ‎05-26-2009

Hi

I'm starting to developing/improving a low level image processing library highly optimized for speed. The goal is to replace the low level C/ASM Version with a better design using C++. So it is very important that the auto vectorization works well. So I created a small test programm implementing a simple point operator. The C++ version uses a function (filter) representing the operator which is called in the inner loop:

[cpp]#include 
#include 
#include 
using namespace std;

int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
  return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
           ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}


class Image
{
    private:
        int width;
        int height;
        uint8_t *data;

    public:
        Image(int w, int h)
        {
            width = w;
            height = h;
            data = new uint8_t[w*h];
        }

        int getWidth() { return width; }
        int getHeight() { return height; }

        uint8_t *operator[] (int row)
        {
            return &data[row * width];
        }


        ~Image()
        {
            delete [] data;
        }
};


void filter(Image &img, int x, int y)
{
    img = img + 1;
}

int main(int argc, char *argv[])
{
    const int WIDTH = 8192;
    const int HEIGHT = 20000;
    Image *img = new Image(WIDTH, HEIGHT);
    boost::timer timer = boost::timer();

    //C++ Version:
    timer.restart();
    int sum = 0;
    for (int y=0; ygetHeight(); y++)
    {
        for (int x=0; xgetWidth(); x++)
            filter(*img, x, y);
    }
    sum = *img[0][0];

    double stop = timer.elapsed();
    cout << "Time C++ Version: " << stop << "s, MPixel/s: " << (img->getWidth() * img->getHeight()) / stop / 1024.0 / 1024.0 << ", Sum: " << sum << endl;

    delete img;

    //Low-level C Version:
    uint8_t *data = (uint8_t *) malloc(sizeof(uint8_t) * WIDTH * HEIGHT);
    timer.restart();
    for (int y=0; y<< "Time C   Version: " << stop << "s, MPixel/s: " << (WIDTH * HEIGHT) / stop / 1024.0 / 1024.0 << ", Sum: " << sum << endl;

    return 0;
}
[/cpp]

This inner loop is not vectorized in the C++ version, so it is slower (CPU: Intel Pentium M processor 1.80GHz):
Time C++ Version: 0.65s, MPixel/s: 240.385, Sum: 1
Time C Version: 0.28s, MPixel/s: 558.036, Sum: 1

The compiler says "filter.cpp(61): (col. 30) remark: loop was not vectorized: unsupported loop structure". When I replace "img->getWidth()" with a constant, it is also not vectorized: "filter.cpp(61): (col. 9) remark: loop was not vectorized: existence of vector dependence".

Is there a solution for this problem without falling back to the low level version? It is important to have a filter function that can be easily exchanged and is easy to write.

Simon

Compiler options: -funroll-loops -mtune=pentium-m -msse2 -O3 -unroll-aggressive -fomit-frame-pointer -ffunction-sections -ipo -vec-report3
Compiler version: 11.0

jimdempseyatthecove · ‎05-26-2009

as an experiment, what happens with:

int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y {
for (int x=0; x *img = *img + 1;
}
sum = *img[0][0];

then

int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y {
uint8_t row = *img;
for (int x=0; x row = row + 1;
}
sum = *img[0][0];

Sometimes you have to help the compiler optimization by hand

Jim Dempsey

lrep · ‎05-26-2009

Quoting - jimdempseyatthecove

as an experiment, what happens with:

int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y {
for (int x=0; x *img = *img + 1;
}
sum = *img[0][0];

then

int sum = 0;
int Height = img->getHeight();
int Width = img->getWidth();
for (int y=0; y {
uint8_t row = *img;
for (int x=0; x row = row + 1;
}
sum = *img[0][0];

Sometimes you have to help the compiler optimization by hand

Jim Dempsey

The last version works:
Time C++ Version: 0.25s, MPixel/s: 625, Sum: 1
Time C Version: 0.25s, MPixel/s: 625, Sum: 1

But the problem is that without the function call with access to the hole image (not just one row), it is not possible to develop an flexible framework. I want to separate the filter (development) and the calling of the filter.

TimP · ‎05-26-2009

Could you set -ansi-alias? It's difficult for a compiler to optimize such things if you tell it not to assume your code complies with standards on aliasing. In this case, the only way that could be a legitimate issue is due to the "array of arrays" implications of your code, and for the -ansi-alias option to help, it would have to distinguish data type int from data type int *.
In case this doesn't help, I've noticed also that g++ sometimes does a better job than icpc of vectorizing based on in-lining a function inside the inner loop, and I've worn out my welcome filing compiler performance issues on this. Customer interest in vectorization under such circumstances is often assumed to be low, so it wouldn't hurt to file an issue on premier.intel.com explaining the advantages of optimizing here.

jimdempseyatthecove · ‎05-26-2009

Have you considered making the filter function called from within the array of objects as opposed to called for each object. i.e. the filter function is passed the array of objects.

To put it in other words, the filter is passed a vector of objects. This would be problematic with opaque objects.

Jim

levicki · ‎05-28-2009

There are reasons why C or ASM are used in some cases and for some tasks.

C++ is in my opinion ill-suited for this kind of work, not to mention that you will be risking regressions in performance critical code because the next compiler version might not vectorize/optimize the code in the same way the previous compiler version did.

If you really want to trust the performance critical parts of the code to the external uncontrollable factor then go ahead. Otherwise, you need to rethink your approach and to determine what are you trying to accomplish and how much gain can be expected from doing it.

Finally, if the above code sample can be taken as a measure of overall code performance I would dare to say that you are wasting your time. Library which uses new[] for allocating (unaligned) memory for image processing doesn't sound too optimized to me.

If I were you, I would consider using IPP for image processing rather than reinventing the wheel.