Solved: Re: 2D array on GPU with USM

leilag · ‎07-03-2021

Hello,

I am porting my code to DPC++ but I have run into a problem. I have narrowed down the problem to this unit test.

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
#define DOMAIN_SIZE M_LEN*N_LEN
#define DIM 1


void VecAdd(queue &q, range<DIM> R, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {

  auto e = q.parallel_for(R, [=](auto i) { 
      sum[i] = a[i] + b[i]; 
  });

  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
    default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;
    
    int **u = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    
    int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_;
    int v_[3][DOMAIN_SIZE]; int *_v_[3] = {v_[0], v_[1], v_[2]}; v = _v_;
    int p_[3][DOMAIN_SIZE]; int *_p_[3] = {p_[0], p_[1], p_[2]}; p = _p_;
    
    auto e = q.parallel_for(R, [=](auto i) { 
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    
    VecAdd(q, R, u[0], v[0], p[0]);
    
    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
    
    free(u, q);
    free(v, q);
    free(p, q);
    
    return 0;
}

This code compiles but throws the following error:

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  Native API failed. Native API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)
Aborted

As discussed previously here I decided to change my buffer model to USM. So, this kind of array declaration has been tested and had been working fine with the buffer model. Moreover, this code gives me a correct output on CPU while giving the same error.

I don't understand what I am doing wrong here and what the error says.

Could you please help me with this?

Thanks,

Leila

@NoorjahanSk_Intel

NoorjahanSk_Intel · ‎07-12-2021

Hi,

The main cause of your error is the way you are allocating memory. Dynamic allocation uses Heap memory where as static allocation uses stack memory, You are trying to merge both methods.

Instead of this >>int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_; you can use this line >> u[0] = malloc_shared<int>(DOMAIN_SIZE, q);

We need to use e.wait(); after every parallel_for loop as this synchronizes the data before we proceed to any other operation on data.

>> I don't know where to look up the versions.

You can check version by using compiler --version command ex: dpcpp --version

If you have small input size, you can create 1D pointers and can traverse through row*array_width+column.

You can find below complete snippet:

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
constexpr size_t  DOMAIN_SIZE = M_LEN*N_LEN;
#define DIM 1

void VecAdd(queue &q,size_t size, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {
    range<1> num_items{size};
  auto e = q.parallel_for(num_items, [=](auto i) {
      sum[i] = a[i] + b[i];
  });
  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
   default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    int **u = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(DOMAIN_SIZE, q);
    for(int i=0;i<3;i++) {
            u[i] = malloc_shared<int>(DOMAIN_SIZE, q);

            v[i] = malloc_shared<int>(DOMAIN_SIZE, q);
            p[i] = malloc_shared<int>(DOMAIN_SIZE, q);
    }
     auto e=q.parallel_for(R, [=](auto i) {
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    e.wait();
    VecAdd(q, DOMAIN_SIZE, u[0], v[0], p[0]);

    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
   free(u,q);
   free(v,q);
   free(p,q);
    return 0;
}

Let us know if it helps.

Thanks & Regards

Noorjahan

View solution in original post

NoorjahanSk_Intel · ‎07-05-2021

Hi,

Thanks for reaching out to us.

We are also able to reproduce the same issue on our end.

We are looking into your issue internally. We will get back to you soon.

Meanwhile, could you please provide the following environment details

Compiler version

OS & it's version.

Thanks & Regards

Noorjahan.

leilag · ‎07-06-2021

Hi,

Thank you for looking into this.

I am running the code on Inter DevCloud. I don't know where to look up the versions.

Thanks,

Leila

NoorjahanSk_Intel · ‎07-12-2021

Hi,

The main cause of your error is the way you are allocating memory. Dynamic allocation uses Heap memory where as static allocation uses stack memory, You are trying to merge both methods.

Instead of this >>int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_; you can use this line >> u[0] = malloc_shared<int>(DOMAIN_SIZE, q);

We need to use e.wait(); after every parallel_for loop as this synchronizes the data before we proceed to any other operation on data.

>> I don't know where to look up the versions.

You can check version by using compiler --version command ex: dpcpp --version

If you have small input size, you can create 1D pointers and can traverse through row*array_width+column.

You can find below complete snippet:

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
constexpr size_t  DOMAIN_SIZE = M_LEN*N_LEN;
#define DIM 1

void VecAdd(queue &q,size_t size, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {
    range<1> num_items{size};
  auto e = q.parallel_for(num_items, [=](auto i) {
      sum[i] = a[i] + b[i];
  });
  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
   default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    int **u = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(DOMAIN_SIZE, q);
    for(int i=0;i<3;i++) {
            u[i] = malloc_shared<int>(DOMAIN_SIZE, q);

            v[i] = malloc_shared<int>(DOMAIN_SIZE, q);
            p[i] = malloc_shared<int>(DOMAIN_SIZE, q);
    }
     auto e=q.parallel_for(R, [=](auto i) {
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    e.wait();
    VecAdd(q, DOMAIN_SIZE, u[0], v[0], p[0]);

    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
   free(u,q);
   free(v,q);
   free(p,q);
    return 0;
}

Let us know if it helps.

Thanks & Regards

Noorjahan

leilag · ‎07-13-2021

Hello Noorjahan,

Thank you for taking the time and debugging my code. It did resolve the issue.

All the best,

Leila

NoorjahanSk_Intel · ‎07-13-2021

Hi,

Thank you for accepting as a solution.

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

Thanks & Regards

Noorjahan.