Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

Error -50 (CL_INVALID_ARG_VALUE) for Intel iGPU

H__Kamil
Beginner
5,328 Views

Hello, 
I have a problem with execution of the following code:
 

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
using namespace std;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;
    
    int* A;
    int* d_A;
    
    void init()
    {
        q = queue(gpu_selector{});
        
        A = (int* )malloc(sizeof(int)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

        d_A = (int *)malloc_device(sizeof(int)*SIZE, q.get_device(), q.get_context());
    }
     
    void test()
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";
   
        q.submit([&](handler &h){
            h.memcpy(d_A, A, sizeof(int)*SIZE);
        });
        q.wait();

        q.submit([&](handler &h){
            h.parallel_for(range<1>{SIZE}, [=](id<1> i){
                d_A += 10;
            });
        });
        q.wait();
       
        q.submit([&](handler &h){
            h.memcpy(A, d_A, sizeof(int)*SIZE);
        });
        q.wait();
        
        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free()
    {
       free(A);
       free(d_A, q.get_context());
    }
};

int main()
{
    DeviceData dev;
    dev.init();
    dev.test(); 
    dev.free();
    return 0;
}

The above code works well for CPU (when using cpu_selector{}), while it fails for GPU. I obtain the error: OpenCL API failed. OpenCL API returns: -50 (CL_INVALID_ARG_VALUE) -50 (CL_INVALID_ARG_VALUE). The problem occurs during execution of the kernel within test() function. This code is used to present the problem. I am working with more complex application which is implemented in the presented manner (using structures). The code is executed on Intel DevCloud.

I will be grateful for any advice.
Thanks :)

0 Kudos
1 Solution
Anoop_M_Intel
Moderator
5,328 Views

Hi Kamil,

In your program, d_A is a device pointer. When accessing the class member data within the class member functions, the "this" pointer is passed implicitly and that is used to access the object members. So in this case, since d_A is a class member, any reference to d_A inside the class member function will be resolved as this->d_A. The caveat is, "this" pointer points to the memory where the class object is located and this class object is created on CPU memory. Hence when we try to use d_A inside a DPC++ kernel which is supposed to run inside a GPU, the expression "this->d_A" is invalid since "this" is invalid pointer from GPU side. The same code works on the CPU side since "this" pointer is valid pointer when the code executes on CPU. Below is a way to work around this issue by capturing the "this->d_A - device pointer" in a local variable in the kernel function (also I have modified the code to introduce newer syntax for USM which decreases the code verbosity):

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
using namespace std;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;

    int* A;
    int* d_A;

    void init()
    {
        q = queue(gpu_selector{});

        A = (int* )malloc(sizeof(int)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

        d_A = (int *)malloc_device(sizeof(int)*SIZE, q.get_device(), q.get_context());
    }

    void test()
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";

        q.memcpy(d_A, A, sizeof(int)*SIZE);
        q.parallel_for(range<1>{SIZE}, [=,d_A_local=this->d_A](id<1> i){
                d_A_local += 10;
        });
        q.memcpy(A, d_A, sizeof(int)*SIZE).wait();

        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free()
    {
            std::free(A);
            sycl::free(d_A, q.get_context());
    }
};

int main()
{
    DeviceData dev;
    dev.init();
    dev.test();
    dev.free();
    return 0;
}

 

View solution in original post

0 Kudos
16 Replies
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

Thanks for reaching out to us!

Could you please attach the error logs or screenshots of the output for the above code after running on devcloud and also mention the node name on which you are running. So that we would be able to investigate more on your issue. 

Please provide the logs for both the cases after running your code with cpu_selector and gpu_selector.

 

Thanks

Goutham.

0 Kudos
H__Kamil
Beginner
5,328 Views

Output for cpu_seletor

./main.exe
Before
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
After
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Output for gpu_selector:

./main.exe
Before
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  OpenCL API failed. OpenCL API returns: -50 (CL_INVALID_ARG_VALUE) -50 (CL_INVALID_ARG_VALUE)
Aborted

Node name: s001-n160

Thanks for help!

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

We are able to reproduce the error which you are facing. 

Please find the below workaround for your code.

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
using namespace std;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;
    
    int* A;
    int* d_A;

    void init(int* A,int* d_A,queue q)
    {
        q = queue(gpu_selector{});
        
//        A = (int* )malloc(sizeof(int)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

  //      d_A = (int *)malloc_device(sizeof(int)*SIZE, q.get_device(), q.get_context());
    }
     
    void test(int* A,int* d_A,queue q)
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";
   
        q.submit([&](handler &h){
            h.memcpy(d_A, A, sizeof(int)*SIZE);
        });
        q.wait();

        q.submit([&](handler &h){
            h.parallel_for(range<1>{SIZE}, [=](id<1> i){
                d_A += 10;
            });
        });
        q.wait();
       
        q.submit([&](handler &h){
            h.memcpy(A, d_A, sizeof(int)*SIZE);
        });
        q.wait();
        
        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free(int* A,int* d_A,queue q)
    {
       std::free(A);
       cl::sycl::free(d_A, q.get_context());
    }
};


int main()
{
    DeviceData dev;
    dev.d_A=(int *)malloc_device(sizeof(int)*SIZE, dev.q.get_device(), dev.q.get_context());
    
    dev.A=(int* )malloc(sizeof(int)*SIZE);
    dev.init(dev.A,dev.d_A,dev.q);
    dev.test(dev.A,dev.d_A,dev.q); 
    dev.free(dev.A,dev.d_A,dev.q);
    return 0;
}

 

 

Please let us know if this resolves your issue. 

 

Regards

Goutham

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

Could you please let us know if your issue is resolved or if you are facing any issues with the code?

 

Thanks

Goutham

 

 

0 Kudos
H__Kamil
Beginner
5,328 Views

Hi,

i am apologize for delay - spring holiday. Yes, the code provided by you works fine. 

But, I have a question. Why my code does not work properly? I don't know if I am right, but is seems that the queue should be created within the same instruction block next to kernel and buffers.

Best regards,
Kamil

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

We are investigating more about your issue. 

We are working with the concerned team and we will get back to you with more information. 

 

Regards

Goutham

0 Kudos
H__Kamil
Beginner
5,328 Views

Hi,

i am waiting for more information :) Thanks you for helping me solve the problem :)

Best regards 

0 Kudos
Anoop_M_Intel
Moderator
5,329 Views

Hi Kamil,

In your program, d_A is a device pointer. When accessing the class member data within the class member functions, the "this" pointer is passed implicitly and that is used to access the object members. So in this case, since d_A is a class member, any reference to d_A inside the class member function will be resolved as this->d_A. The caveat is, "this" pointer points to the memory where the class object is located and this class object is created on CPU memory. Hence when we try to use d_A inside a DPC++ kernel which is supposed to run inside a GPU, the expression "this->d_A" is invalid since "this" is invalid pointer from GPU side. The same code works on the CPU side since "this" pointer is valid pointer when the code executes on CPU. Below is a way to work around this issue by capturing the "this->d_A - device pointer" in a local variable in the kernel function (also I have modified the code to introduce newer syntax for USM which decreases the code verbosity):

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
using namespace std;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;

    int* A;
    int* d_A;

    void init()
    {
        q = queue(gpu_selector{});

        A = (int* )malloc(sizeof(int)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

        d_A = (int *)malloc_device(sizeof(int)*SIZE, q.get_device(), q.get_context());
    }

    void test()
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";

        q.memcpy(d_A, A, sizeof(int)*SIZE);
        q.parallel_for(range<1>{SIZE}, [=,d_A_local=this->d_A](id<1> i){
                d_A_local += 10;
        });
        q.memcpy(A, d_A, sizeof(int)*SIZE).wait();

        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free()
    {
            std::free(A);
            sycl::free(d_A, q.get_context());
    }
};

int main()
{
    DeviceData dev;
    dev.init();
    dev.test();
    dev.free();
    return 0;
}

 

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Anoop,

Thanks for providing a detailed explanation.

 

Hi Kamil,

Please let us know if the information provided is helpful.

Confirm if your issue is resolved and let us know whether we can close this thread.

 

Thanks

Goutham

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

Please confirm if the explanation provided helped.

Let us know if we can close this thread. 

 

Regards

Goutham

0 Kudos
H__Kamil
Beginner
5,328 Views

Hi,

yes, this advice resolve my problem completely.

I have another question, but I don't want to create new topic. I have a problem with math functions like cos(), sin() etc. on the iGPU  side. When I am trying use them within the kernel I obtain the error: undefined reference to `cos()'. 

Best regards,
Kamil

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

We are glad to know that the solution provided resolved your issue. 

 

Sure, we will help you with your other issue. Please provide your code if you can. So that it will help us to investigate.

 

 

Regards

Goutham

0 Kudos
H__Kamil
Beginner
5,328 Views

I modified code presented in this topic to show the problem:

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
#include <cmath>
using std::cout;
using std::endl;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;

    double* A;
    double* d_A;

    void init()
    {
        q = queue(gpu_selector{});

        A = (double* )malloc(sizeof(double)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

        d_A = (double *)malloc_device(sizeof(double)*SIZE, q.get_device(), q.get_context());
    }

    void test()
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";

        q.memcpy(d_A, A, sizeof(double)*SIZE);
        q.parallel_for(range<1>{SIZE}, [=,d_A_local=this->d_A](id<1> i){
                d_A_local = cos(i+1);
        });
        q.memcpy(A, d_A, sizeof(double)*SIZE).wait();

        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free()
    {
            std::free(A);
            sycl::free(d_A, q.get_context());
    }
};

int main()
{
    DeviceData dev;
    dev.init();
    dev.test();
    dev.free();
    return 0;
}

Output:
 

./main.exe
Before
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Intel(R) Gen9 HD Graphics NEO':

error: undefined reference to `cos()'

error: backend compiler failed build.
 0 (CL_SUCCESS)
Makefile:43: recipe for target 'run' failed
make: *** [run] Aborted

 

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

Please find the attached code which will resolve the issue. As per the SYCL standard, the built-in functions (sin(),cos()..etc) can take as input float or optionally double. 

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <cstdio>
#include <cmath>
using std::cout;
using std::endl;
using namespace cl::sycl;

#define SIZE 20

struct DeviceData
{
    queue q;

    double* A;
    double* d_A;

    void init()
    {
        q = queue(gpu_selector{});

        A = (double* )malloc(sizeof(double)*SIZE);
        for(int i=0; i<SIZE; ++i)
            A = i+1;

        d_A = (double *)malloc_device(sizeof(double)*SIZE, q.get_device(), q.get_context());
    }

    void test()
    {
        cout << "Before" << "\n";
        for(int i=0; i<SIZE; ++i)
        {
            cout << A << " ";
        }
        cout << "\n";

        q.memcpy(d_A, A, sizeof(double)*SIZE);
        q.parallel_for(range<1>{SIZE}, [=,d_A_local=this->d_A](id<1> i){
                d_A_local = cl::sycl::cos((float)i+1); // cos() takes input as float datatype
        });
        q.memcpy(A, d_A, sizeof(double)*SIZE).wait();

        cout << "After" << "\n";
        for(int i=0; i<SIZE; ++i)
            cout << A << " ";
        cout << "\n";
    }
    void free()
    {
            std::free(A);
            sycl::free(d_A, q.get_context());
    }
};

int main()
{
    DeviceData dev;
    dev.init();
    dev.test();
    dev.free();
    return 0;
}

 

 

Please confirm if this resolves the issue.

 

 

Regards

Goutham

0 Kudos
H__Kamil
Beginner
5,328 Views

Hi,

yes, it works fine. Thanks for help :)

0 Kudos
GouthamK_Intel
Moderator
5,328 Views

Hi Kamil,

Glad to hear that your issue got resolved. 

We are closing this thread.

Feel free to raise a new thread in case of any further support. 

 

Regards

Goutham

0 Kudos
Reply