Parallel Image Processing in OpenMP - Image Blocks

Royi · ‎03-29-2015

Hello,
I'm doing my first steps in the OpenMP world.

I have an image I want to apply a filter on.
Since the image is large I wanted to break it into non overlapping parts and apply the filter on each independently in parallel.
Namely, I'm creating 4 images I want to have different threads.

I'm using Intel IPP for the handling of the images and the function to apply on each sub image.

I described the code here:

http://stackoverflow.com/questions/29319226/parallel-image-processing-in-openmp-splitting-image

The problem is I tried both sections and parallel for and got only 20% improvement.

What am I doing wrong?
How can I tell each "Worker" that though data is taken from the same array, it is safe to read (Data won't change) and write (Each worker has exclusive approach to its part of the result image).

Thank You.

jimdempseyatthecove · ‎03-30-2015

The "usual suspects" for poor parallization improvement

1) Your test code is setup to time one pass through one parallel region. IOW the runtime includes the time to initialize the OpenMP thread pool .AND. perform any first touch memory mapping. To avoid this, place a loop around your test code and gather the runtime for each of the iterations. Reject any timing that seems out of the ordinary with the other timings.

2) The amount of work in the parallel region is too small to cover the overhead. If, for example, you are processing multiple images, it may be more effective for each thread to process each image alone.

3) The way you partition the work is not favorable to processing the work. In your link example code it might be more efficient to partition into four horizontal stripes as opposed to four quadrants. Keep in mind that your processor can (if you program correctly) process using SSE/AVX/... small vectors. You will want to structure the computation where the inner most loop can process using SIMD and that the SIMD loop has the most favorable run length. As to what the run length is, will depend upon what you do with it (respect to cache re-usability).

Jim Dempsey

Royi · ‎03-30-2015

Hi Jim,

I'm using Intel IPP function to filter the image, it is optimized to use SSE / AVX.

We tried partitioning Row / Column wise, same result.

We want to time the whole operation.
From getting an image as input to creating the output.

Some suggested this is a memory bounded operation.
How can I confirm that?

Are the more efficient way to create threads using OpenMP?

jimdempseyatthecove · ‎03-30-2015

>>Some suggested this is a memory bounded operation. How can I confirm that?

VTune

Memory bound operations, at times, can be processed in different order that is friendlier to memory order and cache utilization. Can you disclose your filter function?

>>Are the more efficient way to create threads using OpenMP?

This depends on the work that needs to be performed. For example, if you have a series of frames to filter:

struct Frame
{
  int bufferNumber; // 0, 1, ... nFrames-1
  uint8_t* buffer; // buffer[bufferSize];
  ...
  Frame(int n, size_t size)
  {
    bufferNumber = n;
    buffer = new uint8_t[size]; 
  }
  ~Frame() { if(buffer) delete [] buffer;
};

int nFrames = 0; // You determine number of buffers (should be > nThreads)
Frame* Frames = NULL; // [nFrames];
...
nFrames = omp_get_max_threads() + someMore;
Frames = new Frame*[nFrames];
for(int i=0; i < nFrames - 1; ++i)
  Frames = new Frame(i, YourFrameSize);

volatile int frame_empty_index = 0;
volatile int frame_fill_index = 0;
#pragma omp parallel
{
  #pragma omp master
  {
    for(Frame* frame=get_next_frame(); frame;  frame=get_next_frame())
    {
      #pragma omp task
      {
         doWork(frame);
         while(frame->bufferNumber != frame_fill_index)
           Sleep(0); // release time slice
         write_frame(frame);
         frame_fill_index = (frame_fill_index + 1) % nFrames;
      }
    } // for
  } // master
}


//
Frame* get_next_frame()
{
   // wait for buffer
   while((frame_empty_index + 1) % nFrames == frame_fill_index)
     Sleep(0); // release time slice
   Frame* frame = frames[frame_empty_index];
   frame_empty_index = (frame_empty_index +1) % nFrameBuffers;
   read_frame(frame);
   return frame;
}
void write_frame(Frame* frame)
{
  while(frame->bufferNumber != frame_fill_index)
    Sleep(0);
  write(frame);
  frame_fill_index = (frame_fill_index + 1) % nFrames;
}

void doWork(Frame* frame)
{
...
}

Something like the above

Jim Dempsey

Royi · ‎03-31-2015

The filter is separable convolution.
Using Intel IPP Separable Convolution.
The code is here:

http://stackoverflow.com/questions/29319226/parallel-image-processing-in-openmp-splitting-image

jimdempseyatthecove · ‎04-01-2015

>>From getting an image as input to creating the output.

Case 1:

Your application process a single image. Or
Your application process multiple images where frames come in at long intervals.

And in either case you need the least latency between image input and image output.

Case 2:

Multiple images are coming in in fast order and you require the highest throughput.

For case 1, if your image comes in row at a time, partition your image space by rows, then throw a task after each partition comes in. Note, experiment with the number of rows per partition.

For case 2, use my suggestion in #4 as a starting point.

Jim Dempsey