invalid work group size error, dpc++ code running on Intel Arria 10 oneAPI on devcloud

amaltaha · ‎05-17-2022

Hello,
I am using devcloud to run my dpc++ code on FPGA hardware for accelration. I am using a node that runs Arria 10 OneAPI. I was able to run the fpga_emu file and the results were as expected. When I use FPGA hardware it gives this error:

Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

I don't see any problem with the sizes of the work groups.

  range<1> num_items{dataset.size()};

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
         h.parallel_for_work_group(range<1>(32), range<1>(500), [=](group<1> g) {
            g.parallel_for_work_item([&](h_item<1> item) {
                 int i = item.get_global_id(0);
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
               });
            });
        }).wait();
    }

I previously used normal parallel_for like this, and it gave me huge time on FPGA hardware to run, which accelerated nothing actually, that's why I though of work groups:

 range<1> num_items{dataset.size()};
    std::vector<double>res;

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    
    std::cout<<"submit a job"<<std::endl;
    //auto start = std::chrono::high_resolution_clock::now();
    {
    q.submit([&](handler& h) {
        accessor a(dataset_buf, h, read_only);
        accessor b(curr_test_buf, h, read_only);

        accessor dif(res_buf, h, read_write, no_init);
        h.parallel_for(num_items, [=](auto i) {
            //  dif[i] = a[i].size() * 1.0;// a[i];
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);  
                }
           // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
            });
        }).wait();
    }

Thanks a lot!

aikeu · ‎05-23-2022

Hi amaltaha,

Can share with me through email regarding the project that you are trying to run?

I can try to run on my side and see.

Thanks.

Regards,

Aik Eu

amaltaha · ‎05-23-2022

Hello Aik Eu!

I wanted speed efficiency, I tried to split the 16,000 samples (each contains 5 features, double precision) into smaller chunks. But it didn't work.

Thank you!

aikeu · ‎05-24-2022

Hi amaltaha,

Do you mean the error still there or due to your handling in design?

Thanks.

Regards,

Aik Eu

aikeu · ‎05-29-2022

Hi amaltaha,

I will close this thread if no further question.

Thanks.

Regards,

Aik Eu