Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*

2D array on GPU with USM

leilag
Novice
1,979 Views

Hello,

 

I am porting my code to DPC++ but I have run into a problem. I have narrowed down the problem to this unit test.

 

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
#define DOMAIN_SIZE M_LEN*N_LEN
#define DIM 1


void VecAdd(queue &q, range<DIM> R, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {

  auto e = q.parallel_for(R, [=](auto i) { 
      sum[i] = a[i] + b[i]; 
  });

  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
    default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;
    
    int **u = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(3*DOMAIN_SIZE, q);
    
    int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_;
    int v_[3][DOMAIN_SIZE]; int *_v_[3] = {v_[0], v_[1], v_[2]}; v = _v_;
    int p_[3][DOMAIN_SIZE]; int *_p_[3] = {p_[0], p_[1], p_[2]}; p = _p_;
    
    auto e = q.parallel_for(R, [=](auto i) { 
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    
    VecAdd(q, R, u[0], v[0], p[0]);
    
    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
    
    free(u, q);
    free(v, q);
    free(p, q);
    
    return 0;
}

 

 

This code compiles but throws the following error:

 

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  Native API failed. Native API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)
Aborted

 

 

As discussed previously here I decided to change my buffer model to USM. So, this kind of array declaration has been tested and had been working fine with the buffer model. Moreover, this code gives me a correct output on CPU while giving the same error. 

I don't understand what I am doing wrong here and what the error says.

Could you please help me with this?

 

Thanks,

Leila

 

@NoorjahanSk_Intel 

0 Kudos
1 Solution
NoorjahanSk_Intel
Moderator
1,848 Views

Hi,

 The main cause of your error is the way you are allocating memory. Dynamic allocation uses Heap memory where as static allocation uses stack memory, You are trying to merge both methods.

Instead of this >>int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_; you can use this line >> u[0] = malloc_shared<int>(DOMAIN_SIZE, q); 

We need to use e.wait(); after every parallel_for loop as this synchronizes the data before we proceed to any other operation on data.

>> I don't know where to look up the versions.

You can check version by using compiler --version command ex: dpcpp --version

If you have small input size, you can create 1D pointers and can traverse through row*array_width+column.

You can find below complete snippet:

 

 

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
constexpr size_t  DOMAIN_SIZE = M_LEN*N_LEN;
#define DIM 1

void VecAdd(queue &q,size_t size, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {
    range<1> num_items{size};
  auto e = q.parallel_for(num_items, [=](auto i) {
      sum[i] = a[i] + b[i];
  });
  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
   default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    int **u = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(DOMAIN_SIZE, q);
    for(int i=0;i<3;i++) {
            u[i] = malloc_shared<int>(DOMAIN_SIZE, q);

            v[i] = malloc_shared<int>(DOMAIN_SIZE, q);
            p[i] = malloc_shared<int>(DOMAIN_SIZE, q);
    }
     auto e=q.parallel_for(R, [=](auto i) {
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    e.wait();
    VecAdd(q, DOMAIN_SIZE, u[0], v[0], p[0]);

    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
   free(u,q);
   free(v,q);
   free(p,q);
    return 0;
}

 

Let us know if it helps.

 

Thanks & Regards

Noorjahan

View solution in original post

5 Replies
NoorjahanSk_Intel
Moderator
1,939 Views

Hi,

Thanks for reaching out to us.

We are also able to reproduce the same issue on our end.

 We are looking into your issue internally. We will get back to you soon.

Meanwhile, could you please provide the following environment details

  Compiler version 

  OS & it's version.

 

Thanks & Regards

Noorjahan.

 

0 Kudos
leilag
Novice
1,899 Views

Hi,

 

Thank you for looking into this.

I am running the code on Inter DevCloud. I don't know where to look up the versions.

 

Thanks,

Leila

0 Kudos
NoorjahanSk_Intel
Moderator
1,849 Views

Hi,

 The main cause of your error is the way you are allocating memory. Dynamic allocation uses Heap memory where as static allocation uses stack memory, You are trying to merge both methods.

Instead of this >>int u_[3][DOMAIN_SIZE]; int *_u_[3] = {u_[0], u_[1], u_[2]}; u = _u_; you can use this line >> u[0] = malloc_shared<int>(DOMAIN_SIZE, q); 

We need to use e.wait(); after every parallel_for loop as this synchronizes the data before we proceed to any other operation on data.

>> I don't know where to look up the versions.

You can check version by using compiler --version command ex: dpcpp --version

If you have small input size, you can create 1D pointers and can traverse through row*array_width+column.

You can find below complete snippet:

 

 

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

#define M 4
#define N 5
#define M_LEN (M + 2)
#define N_LEN (N + 2)
constexpr size_t  DOMAIN_SIZE = M_LEN*N_LEN;
#define DIM 1

void VecAdd(queue &q,size_t size, const int a[DOMAIN_SIZE], const int b[DOMAIN_SIZE], int sum[DOMAIN_SIZE]) {
    range<1> num_items{size};
  auto e = q.parallel_for(num_items, [=](auto i) {
      sum[i] = a[i] + b[i];
  });
  e.wait();
}

int main() {
    auto R = range<1>{DOMAIN_SIZE};
   default_selector d_selector;
    queue q(d_selector);
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    int **u = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **v = malloc_shared<int *>(DOMAIN_SIZE, q);
    int **p = malloc_shared<int *>(DOMAIN_SIZE, q);
    for(int i=0;i<3;i++) {
            u[i] = malloc_shared<int>(DOMAIN_SIZE, q);

            v[i] = malloc_shared<int>(DOMAIN_SIZE, q);
            p[i] = malloc_shared<int>(DOMAIN_SIZE, q);
    }
     auto e=q.parallel_for(R, [=](auto i) {
        u[0][i] = i;
        v[0][i] = 2*i;
    });
    e.wait();
    VecAdd(q, DOMAIN_SIZE, u[0], v[0], p[0]);

    for (int i=0; i<DOMAIN_SIZE; i++)
      std::cout << "p[0][" << i << "] = " << p[0][i] << std::endl;
   free(u,q);
   free(v,q);
   free(p,q);
    return 0;
}

 

Let us know if it helps.

 

Thanks & Regards

Noorjahan

leilag
Novice
1,827 Views

Hello Noorjahan,

 

Thank you for taking the time and debugging my code. It did resolve the issue.

 

All the best,

Leila

0 Kudos
NoorjahanSk_Intel
Moderator
1,816 Views

Hi,

Thank you for accepting as a solution.

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

 

Thanks & Regards

Noorjahan.

 

0 Kudos
Reply