Error executing basic test code for FPGA (oneAPI/OpenCL)

RN1 · ‎03-28-2022

Greetings,

We are running some experiments and we have detected that oneAPI for FGPA is not performing well, quite below the previous pure OpenCL implementation (same system, fpga, etc). We tested with different codes, and in even a simple Matrix Multiplication the OpenCL code/runtime performs much better than in a pure oneAPI one. So, we thought about doing OpenCL interoperability with oneAPI (as in this guide https://www.intel.com/content/www/us/en/developer/articles/technical/interoperability-dpcpp-sycl-opencl.html), to see if DPC++ can avoid all the overhead/wrong optimization steps of oneAPI by using something closer to the OpenCL implementation/runtime optimizations. We don't know what oneAPI is adding during the compilation/execution steps, but now using the FPGA is not worthwhile.

The fpga emulator works with the interoperability, but the real fpga does not. Here are our steps:

Error

user@node:~/interoperability$ ./ingest_fpga
Device: s10gx : Stratix 10 Reference Platform (acls10_ref0)
terminate called after throwing an instance of 'cl::sycl::feature_not_supported'
  what():  Online compilation is not supported by this device -3 (CL_COMPILER_NOT_AVAILABLE)

Code

#include <CL/sycl.hpp>
#include <iostream>
#include <array>

#include <CL/sycl/INTEL/fpga_extensions.hpp>
#include "dpc_common.hpp"

using namespace cl::sycl;

int main()
{
        const size_t szKernelData = 32;
        std::array<float, szKernelData> kernelData;
        kernelData.fill(-99.f);
        range<1> r(szKernelData);
        sycl::INTEL::fpga_selector _device;
        queue q{_device};
        //queue q{fpga_selector()};
        std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
        program p(q.get_context());

        p.build_with_source(
          R"CLC(
            kernel void sinf_test(global float* data) {
              data[get_global_id(0)] = sin(get_global_id(0)*2*M_PI_F/get_global_size(0)) ;
            }
          )CLC", "-cl-std=CL1.2");
        {
                buffer<float, 1> b(kernelData.data(), r);

                q.submit([&](handler& cgh) {
                        auto b_accessor = b.get_access<access::mode::read_write>(cgh);
                        cgh.set_args(b_accessor);
                        cgh.parallel_for(r, p.get_kernel("sinf_test"));
                });
        }
        for(auto& elem : kernelData)
                std::cout << std::defaultfloat << elem << " " << std::hexfloat << elem << std::endl;
        return 0;
}

Compilation

dpcpp -fintelfpga -Xshardware -fsycl-unnamed-lambda ingest.cpp -std=c++17 -o ingest_fpga
# Output
WARNING: No kernels specified.

Any idea?

BoonBengT_Intel · ‎03-31-2022

Hi @RN1,

Thank you for posting in Intel community forum on your interest in oneAPI and hope all is well.

While we are looking into the interoperability error, question if I may, would it be possible to share the mention test codes that you have running for both pure OpenCl and OneAPI which will allow us to check further.

I have dropped you a note in your mailbox, if desire a more secure way to share the codes.

Hope to hear from you soon.

Best Wishes

BB

BoonBengT_Intel · ‎04-06-2022

Hi @RN1,

Good day, just following up on the previous clarification.

By any chances did you managed to look into the it?

Best Wishes

BB

RN1 · ‎04-10-2022

Good day.

I answered in the other thread.

We did more tests and nothing, the performance is not good and the results are the same as those attached in the other post.

We compared "System viewers->Graph viewer (beta)" in openCL with "Views->System viewer" in oneAPI. OpenCL is doing coalescing to local accesses, while oneAPI it isn't. It uses many small units. We assume that regarding the DSPs something is bad as well, but it is complex to see since the generated graph is huge. Maybe it is not reducing/trimming the tree, but we did nothing different from OpenCL, so, we need support in how to optimize these local accesses to achieve at least similar performance than in OpenCL (nothing is explained in the optimization guides).

Following this thread, Do you know why interoperability OpenCL-oneAPI does not work with the fpga?

BoonBengT_Intel · ‎04-13-2022

Hi @RN1,

Noted on the steps taken for the performance clarification, and also this thread seems to be related to the error on executing in DevCloud which is resolved due to the node issues, hence would close this thread and follow up on the performance clarification on another thread itself as below:

https://community.intel.com/t5/forums/forumtopicpage/board-id/high-level-design/message-id/2395#M2395

Pleasure having you here.

Best Wishes

BB