Re:Blur kernel - interop - performance

RN1 · ‎11-13-2020

Hi,

I have two questions.

Question A) Why this kernel is slower than the opencl version? It does not matter if is the CPU or the GPU, but I get something like (OCL vs ONE):

- size 1024: CPU 5s vs 11s

- size 3072: CPU 45s vs 82s

- size 1024: GPU 0.5s vs 1.1s

- size 3072: GPU 4s vs 7s

One interesting thing is that the kernel computes in something close to the OCL version (not as fast, but near), but the whole program is much slower. For example, for a 1024 size in GPU:

init containers
init filter
Device: Intel(R) Graphics Gen9 [0x5912]
Time taken by data init + queue is: 1.3584
Time taken by queue is : 1.054370 sec
Time taken by kernel is : 1.054352 sec
Time taken from init : 1.358406 sec
profiling queue ek 0.615206
res: 0x5100000 => [0](48,56,52,0), [1048575](54,42,54,0)
1.19user 0.32system 0:01.53elapsed 99%CPU (0avgtext+0avgdata 158636maxresident)k
0inputs+12304outputs (0major+50631minor)pagefaults 0swaps

The profiling queue ek is 0.6s, but the program is 1.35s.
On the other side, in OpenCL I get 0.4s for the kernel execution and 0.9s to run the whole program.

==> Am I doing something wrong? Is the kernel properly ported?

Because the conclusion here is shocking: raw OpenCL is faster both in kernel computing and total program execution than OneAPI, for this specific case and kernel type (blur).

#include <CL/sycl.hpp>
#include <iostream>
#include <iomanip>
#include <ctime>
#include <sys/time.h>
#include <omp.h>
#include <cstdlib>


// using namespace std;
using sycl::uchar;
using sycl::uchar4;
using sycl::float4;
// using sycl::queue;
using sycl::buffer;
using sycl::accessor;
using sycl::range;

using std::ios_base;
using std::cout;
using std::setprecision;
using std::fixed;
using std::ostream;

namespace dpc {
// this exception handler with catch async exceptions
static auto exception_handler = [](cl::sycl::exception_list eList) {
  for (std::exception_ptr const &e : eList) {
    try {
      std::rethrow_exception(e);
    } catch (std::exception const &e) {
#if _DEBUG
      std::cout << "Failure" << std::endl;
#endif
      std::terminate();
    }
  }
};

static double ReportTime(sycl::event e) {
  cl_ulong time_start =
      e.get_profiling_info<sycl::info::event_profiling::command_start>();

  cl_ulong time_end =
      e.get_profiling_info<sycl::info::event_profiling::command_end>();

  double elapsed = (time_end - time_start) / 1e6;
  return elapsed;
}

}

// No matter, i will use USM here
#if !defined(USM)
#error define USM to 0 or 1
#endif

#ifdef __SYCL_DEVICE_ONLY__
#define CONSTANT __attribute__((opencl_constant))
#else
#define CONSTANT
#endif

typedef float op_type;

constexpr int size = 1024 + 2048; //2048;
//constexpr int size = 1024; //2048;
constexpr int filter_size = 61;

const int its = 3;

inline ostream &
operator<<(ostream &os, uchar4 &t) {
  os << "(" << (int) t.x() << "," << (int) t.y() << "," << (int) t.z() << "," << (int) t.w() << ")";
  return os;
}

void fill_filter(float* filter, int filter_width) {
  const float sigma = 2.f;

  const int half = filter_width / 2;
  float sum = 0.f;

  int r;
  for (r = -half; r <= half; ++r) {
    int c;
    for (c = -half; c <= half; ++c) {
      float weight = expf(-(float) (c * c + r * r) / (2.0f * sigma * sigma));
      int idx = (r + half) * filter_width + c + half;

      filter[idx] = weight;

      sum += weight;
    }
  }

  float normal = 1.0f / sum;

  for (r = -half; r <= half; ++r) {
    int c;
    for (c = -half; c <= half; ++c) {
      int idx = (r + half) * filter_width + c + half;

      filter[idx] *= normal;
      // res[idx] *= normal;
    }
  }
}

int main(int argc, char *argv[]) {
  struct timeval init;
  gettimeofday(&init, NULL);

  std::chrono::high_resolution_clock::time_point tStart = std::chrono::high_resolution_clock::now();

  const int N = size * size;
  const int Nfilter = filter_size * filter_size;

  cl::sycl::property_list prop_list = cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};
  cl::sycl::queue q(sycl::gpu_selector{}, dpc::exception_handler, prop_list);

  uchar4* input = (uchar4*)malloc_shared(N * sizeof(uchar4), q);
  float* filter = (float*)malloc_shared(N * sizeof(float), q);
  uchar4* blurred = (uchar4*)malloc_shared(N * sizeof(uchar4), q);


  srand(0);
  for (auto i = 0; i<N; ++i) {
    // vinput[i] = uchar4{55};
    input[i] = uchar4{rand() % 256, rand() % 256, rand() % 256, 0};
    blurred[i] = uchar4{0};
  }

  cout << "init containers\n";

  fill_filter(filter, filter_size);

  cout << "init filter\n";

  auto R = sycl::range<1>(N);

  struct timeval start, compute, end;
  gettimeofday(&start, NULL);
  sycl::event er, ew, ek;
  try {

    cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

    range<1> Rinput(N);

    range<1> workitems(N);

    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);
    const int cols = size;
    const int rows = size;

    ek = q.submit([&](auto &h){

      h.parallel_for(workitems, [=](auto index){
        auto tid = index.get_linear_id();

        int r = tid / cols;
        int c = tid % cols;

        int middle = filter_size / 2;
        float4 blur{0.f};

        int width = cols - 1;
        int height = rows - 1;

        for (int i = -middle; i <= middle; ++i) // rows
        {
          for (int j = -middle; j <= middle; ++j) // columns
          {

            int h = r + i;
            int w = c + j;
            if (h > height || h < 0 || w > width || w < 0) {
              continue;
            }

            int idx = w + cols * h; // current pixel index

            float4 pixel = input[idx].template convert<float>();

            idx = (i + middle) * filter_size + j + middle;
            float weight = filter[idx];

            blur += pixel * weight;
          }
        }

        blurred[tid] = (cl::sycl::round(blur)).convert<uchar>();
      });
    });


    ek.wait();

  } catch (sycl::exception const &e) {
    cout << "sycl exception: " << e.what() << "\n";
    std::terminate();
  }

  gettimeofday(&end, NULL);
  auto tTemp = std::chrono::high_resolution_clock::now();
  auto diffTemp = (tTemp - tStart).count();
  auto diffTempS = diffTemp / 1e9;
  printf("Time taken by data init + queue is: %6.4f\n", diffTempS);
  double time_taken;
  time_taken = (end.tv_sec - start.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - start.tv_usec)) * 1e-6;
  cout << "Time taken by queue is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  time_taken = (end.tv_sec - compute.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - compute.tv_usec)) * 1e-6;
  cout << "Time taken by kernel is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  time_taken = (end.tv_sec - init.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - init.tv_usec)) * 1e-6;
  cout << "Time taken from init : " << fixed << time_taken << setprecision(6) << " sec " << "\n";

  auto profilingQueueS = dpc::ReportTime(ek) / 1000.0;
  cout << "profiling queue ek " << profilingQueueS << "\n";

  cout << "res: " << blurred << " => [0]" << blurred[0] << ", [" << N-1 << "]" << blurred[N - 1] << "\n";

  sycl::free(input, q);
  sycl::free(filter, q);
  sycl::free(blurred, q);
  return 0;
}

The OpenCL Kernel can be seen also in the next section:

Question B) As another test, I tried to see what happens if I give the OpenCL kernel to do interoperation with OneAPI, but I get always CL_INVALID_OPERATION. What could be wrong?

    program p(q.get_context());
    p.build_with_source(R"CLC(
__kernel void
gaussian_blur(__global uchar4* blurred,
              __global uchar4* input,
              int rows,
              int cols,
              __global float* filterWeight,
              int filterWidth)
{
  int tid = get_global_id(0);

  if (tid < rows * cols) {
    int r = tid / cols; // current row
    int c = tid % cols; // current column

    int middle = filterWidth / 2;
    float blurX = 0.f; // will contained blurred value
    float blurY = 0.f; // will contained blurred value
    float blurZ = 0.f; // will contained blurred value
    int width = cols - 1;
    int height = rows - 1;

    for (int i = -middle; i <= middle; ++i) // rows
    {
      for (int j = -middle; j <= middle; ++j) // columns
      {
        // Clamp filter to the image border
        // int h=min(max(r+i, 0), height);
        // int w=min(max(c+j, 0), width);

        int h = r + i;
        int w = c + j;
        if (h > height || h < 0 || w > width || w < 0) {
          continue;
        }

        // Blur is a product of current pixel value and weight of that pixel.
        // Remember that sum of all weights equals to 1, so we are averaging sum
        // of all pixels by their weight.
        int idx = w + cols * h; // current pixel index
        float pixelX = (input[idx].x);
        float pixelY = (input[idx].y);
        float pixelZ = (input[idx].z);

        idx = (i + middle) * filterWidth + j + middle;
        float weight = filterWeight[idx];

        blurX += pixelX * weight;
        blurY += pixelY * weight;
        blurZ += pixelZ * weight;
      }
    }

    // if (tid == 2592){
    //    printf("%f %f %d %d\n", blurZ, round(blurZ), (unsigned char)round(blurZ),
    //    (int)round(blurZ)); printf("%f %d\n", blurZ, convert_uchar_rte(blurZ)); printf("%f %d\n",
    //    blurZ, convert_uchar_rtz(blurZ)); printf("%f %d\n", blurZ, convert_uchar_rtp(blurZ));
    // }

    // blurred[tid].x = (unsigned char)(blurX);
    // blurred[tid].y = (unsigned char)(blurY);
    // blurred[tid].z = (unsigned char)(blurZ);

    blurred[tid].x = (unsigned char)round(blurX);
    blurred[tid].y = (unsigned char)round(blurY);
    blurred[tid].z = (unsigned char)round(blurZ);
  }
}
)CLC",
                        "-cl-std=CL1.2");

// ...

    ek = q.submit([&](auto &h) {
      auto input = input_buf.get_access<sycl::access::mode::read>(h);
      auto filter = filter_buf.get_access<sycl::access::mode::read>(h);
      auto blurred = blurred_buf.get_access<sycl::access::mode::discard_write>(h);

      h.set_args(blurred, input, size, size, filter, filter_size);

      h.parallel_for(workitems, p.get_kernel("gaussian_blur"));
    });

Thanks

RN1 · ‎11-13-2020

Regarding Question B. I achieved to run it in the CPU, but not in the GPU (the previous error). Although it says it has a compiler and linker available.

What is important is that the OpenCL CPU version interop with OneAPI has the same performance as the raw OpenCL CPU version. This is fantastic, but we have here a problem: how to achieve that performance with pure OneAPI? What was wrong with the code shown in Question A?

I have seen the same behavior with other kernels... pure OneAPI degrades performance compared with pure OpenCL.

GouthamK_Intel · ‎11-16-2020

Hi,

Thanks for reaching out to us!

As you are comparing performance between raw OpenCL with Intel oneAPI, we are forwarding this thread to Subject Matter Expert who can address your queries better.

Thanks & Regards

Goutham

Sravani_K_Intel · ‎11-24-2020

Could you please mention the versions of OpenCL and oneAPI Tools used to demonstrate this behavior?

It would also help if you can upload the raw OpenCL and OpenCL interop versions of the programs.