Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
告知
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
723 ディスカッション

invalid work group size error, dpc++ code running on Intel Arria 10 oneAPI on devcloud

amaltaha
新規コントリビューター I
1,425件の閲覧回数

Hello,
I am using devcloud to run my dpc++ code on FPGA hardware for accelration. I am using a node that runs Arria 10 OneAPI. I was able to run the fpga_emu file and the results were as expected. When I use FPGA hardware it gives this error:

Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

 

I don't see any problem with the sizes of the work groups. 

  range<1> num_items{dataset.size()};

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
         h.parallel_for_work_group(range<1>(32), range<1>(500), [=](group<1> g) {
            g.parallel_for_work_item([&](h_item<1> item) {
                 int i = item.get_global_id(0);
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
               });
            });
        }).wait();
    }

 

I previously used normal parallel_for like this, and it gave me huge time on FPGA hardware to run, which accelerated nothing actually, that's why I though of work groups: 

 range<1> num_items{dataset.size()};
    std::vector<double>res;

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
        h.parallel_for(num_items, [=](auto i) {
            //  dif[i] = a[i].size() * 1.0;// a[i];
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
            });
        }).wait();
    }

 Thanks a lot!

0 件の賞賛
4 返答(返信)
aikeu
従業員
1,375件の閲覧回数

Hi amaltaha,


Can share with me through email regarding the project that you are trying to run?

I can try to run on my side and see.


Thanks.

Regards,

Aik Eu


amaltaha
新規コントリビューター I
1,370件の閲覧回数

Hello Aik Eu!

I wanted speed efficiency, I tried to split the 16,000 samples (each contains 5 features, double precision) into smaller chunks. But it didn't work. 

 

Thank you!

 

aikeu
従業員
1,366件の閲覧回数

Hi amaltaha,


Do you mean the error still there or due to your handling in design?


Thanks.

Regards,

Aik Eu


aikeu
従業員
1,357件の閲覧回数

Hi amaltaha,


I will close this thread if no further question.


Thanks.

Regards,

Aik Eu


返信