Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
573 Discussions

invalid work group size error, dpc++ code running on Intel Arria 10 oneAPI on devcloud

amaltaha
New Contributor I
318 Views

Hello,
I am using devcloud to run my dpc++ code on FPGA hardware for accelration. I am using a node that runs Arria 10 OneAPI. I was able to run the fpga_emu file and the results were as expected. When I use FPGA hardware it gives this error:

Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

 

I don't see any problem with the sizes of the work groups. 

  range<1> num_items{dataset.size()};

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
         h.parallel_for_work_group(range<1>(32), range<1>(500), [=](group<1> g) {
            g.parallel_for_work_item([&](h_item<1> item) {
                 int i = item.get_global_id(0);
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
               });
            });
        }).wait();
    }

 

I previously used normal parallel_for like this, and it gave me huge time on FPGA hardware to run, which accelerated nothing actually, that's why I though of work groups: 

 range<1> num_items{dataset.size()};
    std::vector<double>res;

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
        h.parallel_for(num_items, [=](auto i) {
            //  dif[i] = a[i].size() * 1.0;// a[i];
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
            });
        }).wait();
    }

 Thanks a lot!

0 Kudos
4 Replies
aikeu
Employee
268 Views

Hi amaltaha,


Can share with me through email regarding the project that you are trying to run?

I can try to run on my side and see.


Thanks.

Regards,

Aik Eu


amaltaha
New Contributor I
263 Views

Hello Aik Eu!

I wanted speed efficiency, I tried to split the 16,000 samples (each contains 5 features, double precision) into smaller chunks. But it didn't work. 

 

Thank you!

 

aikeu
Employee
259 Views

Hi amaltaha,


Do you mean the error still there or due to your handling in design?


Thanks.

Regards,

Aik Eu


aikeu
Employee
250 Views

Hi amaltaha,


I will close this thread if no further question.


Thanks.

Regards,

Aik Eu


Reply