Re:Error running OneAPI FPGA emulator

rlb1116 · ‎02-23-2021

Hello,

I converted a CUDA code to DPC++ with the DPCT tool, and I am trying to run this on an FPGA on the DevCloud. I am first trying to test functionality with the FPGA emulator, but I am getting an invalid binary error error shown here:

u40772@s001-n088:~/cmt-fpga/pca$ dpcpp -fintelfpga CMT-bone-pca.dp.cpp -DFPGA_EMULATOR=1 -o cmt.out
u40772@s001-n088:~/cmt-fpga/pca$ ./cmt.out
TBB Warning: The number of workers is currently limited to 23. The request for 31 workers is ignored. Further requests for more workers will be silently ignored until the limit changes.

HOST MESSAGE : Memory Allocation took, 0.00456611 seconds
Max work group size: 4100
Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY)Exception caught at file:CMT-bone-pca.dp.cpp, line:650
u40772@s001-n088:~/cmt-fpga/pca$

As shown, the FPGA compile for emulator completes without error or warning, but execution of the output gives an invalid binary error that was caught in the code block that should call the accelerator device.

The only thing I could find with this specific error is here: https://community.intel.com/t5/Intel-High-Level-Design/CL-INVALID-BINARY-when-running-fast-recompile-example-from/td-p/1224496 which suggests that it is related to the hardware target for compilation not matching up with available resources. However, since I am targeting the FPGA emulator, I would think just the CPU device would be necessary, although I am trying this on an Arria 10 node, so that should be available too.

Any suggestions? Thanks

GouthamK_Intel · ‎02-23-2021

Hi Ryan,

Thanks for reaching out to us!

Could you please share the source codes(CUDA Code and DPCT Migrated code) if possible?

Regards

Goutham

rlb1116 · ‎02-24-2021

Hi Goutham,

I have attached the original CUDA as well as the converted DPC++ code. (sorry for the messiness)

I am wondering if the dpct::get_current_device() is finding the Arria 10 FPGA rather than the FPGA emulator during compilation. However, I haven't been able to find great documentation on the priority of this function. The compilation actually failed when I tried it off the FPGA node, so maybe this makes sense, but I would think the compilation would take significantly longer if it was targeting a physical FPGA.

Please let me know if you have any suggestions, thanks!

GouthamK_Intel · ‎03-02-2021

Hi Ryan,

We have tried opening the attachment provided but the folder is empty.

kindly attach the code again.

Thanks & Regards

Goutham

rlb1116 · ‎03-02-2021

Oops... guess I forgot the -r. I have reattached the files.

In general though, this appears to be some kind of error with the requested vs available devices. Is there any documentation on the oneAPI get_current_device() function? I have not been able to find much that goes into its selection priority.

Thanks

GouthamK_Intel · ‎03-04-2021

Hi Ryan,

Thanks for the reproducer, we are working on your issue.

>>Is there any documentation on the oneAPI get_current_device() function?

Please refer to the below link for the documentation DPCT.

https://software.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top/dpct-namespace-usage-guide.html

Have a Good day!

Thanks & Regards

Goutham

rlb1116 · ‎03-04-2021

Yes, I have seen that guide already, unfortunately it is not very informative.

I see that by default, the converted code uses the dpct::get_current_device() function in order to select its target, but there is no explanation as to how that function prioritizes its choice when there are potentially multiple different targets (e.g., CPU, FPGA, FPGA emulator). I also see that the dev_mgr can change the current device using select_device(), but there is no explanation on how to actually use this function.

Further documentation on how to properly select FPGA and FPGA emulator target devices with DPC++ would be much appreciated. Or if this invalid binary error has nothing to do with device selection that would be good to know. Thanks!

GouthamK_Intel · ‎03-10-2021

Hi Ryan,

We are working on your issue, we will get back to you soon.

Regards

Goutham

rlb1116 · ‎03-11-2021

Thanks, Goutham. Could you provide any insight as to what the potential cause of this error might be? Am I on the right track in thinking it has to do with the OneAPI device_selector? Any information would be helpful.

By the way, I am attaching a slightly updated version of the previous code. If you were able to get past the invalid binary error there would likely be a seg fault, which is now fixed.

cw_intel · ‎03-26-2021

Hi Ryan,

I found that the previous solution was wrong. In the code, the macro ‘USE_GPU’ was set to 0, so the code ran serially and the kernel function was useless. I have delivered your issue to a FPGA expert, and wait for the feedback. If I get any feedbacks, I will let you know.

rlb1116 · ‎03-26-2021

Thanks for the response, I too was initially fooled by the previous solution when the FPGA emulator ran while the USE_GPU (accelerator in this case) flag was 0. That solution (which I no longer see here) did seem to solve the device issue, which is a step in the right direction, as far as porting other CUDA codes to FPGAs with oneAPI.

However, now I am seeing this error from the FPGA emulator (with the accelerator flag set to 1):

"OpenCL API failed. OpenCL API returns: -59 (CL_INVALID_OPERATION) -59 (CL_INVALID_OPERATION)Exception caught at file:test_fpga.cpp, line:677"

Searching for this error, I found this forum post (https://community.intel.com/t5/GPU-Compute-Software/Can-I-mix-openCL-and-level-0-Native-API-returns-59-CL-INVALID/m-p/1255339#U1255339 ), which suggest it might be a memory issue between the kernel and host.

Does that appear to be on the right track in this case? Any Suggestions? Thanks

cw_intel · ‎03-29-2021

Hi Ryan,

We found that this code could run on GPU successfully, but failed on CPU and FPGA emulator. So we need to do more investigations. I will let you know when we find the root cause.

Regards,

Chen

Viet_H_Intel · ‎04-09-2021

We have an expert in FPGA looked at this issue. The --DFPGA_EMULATOR is trying to select a device selector, but the d_selector isn’t actually used anywhere in the code. Instead, it seems the main function gets the device/queue from some dpct class.

Changing the device selector/ queue initializations to something more standard made it work. Attached is the completed source file.

Relevant snippets:

#if FPGA_EMULATOR

INTEL::fpga_emulator_selector d_selector;

#else

default_selector d_selector;

#endif

#include <dpct/dpct.hpp>

int main(int argc, char *argv[]) try {

dpct::device_ext &dev_ct1 = dpct::get_current_device();

sycl::queue &q_ct1 = dev_ct1.default_queue();

...

Viet_H_Intel · ‎04-09-2021

CMT-bone-pca.dp.cpp

cw_intel · ‎08-08-2021

Hi,

I migrated the CUDA code and made some modifications, now it can be run successfully on FPGA Emulator, CPU and GPU

To run on FPGA Emulator,

$ dpcpp CMT-bone-pca_workarounds.dp.cpp

$ export SYCL_DEVICE_TYPE=ACC

$ SYCL_PI_TRACE=1 ./a.out

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so
SYCL_PI_TRACE[all]: Selected device ->
SYCL_PI_TRACE[all]: platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
SYCL_PI_TRACE[all]: device: Intel(R) FPGA Emulation Device
HOST MESSAGE : Memory Allocation took, 0.00134858 seconds
CUDA kernel avg duration: 0.00101494 seconds
CUDA kernel total duration: 0.97434193 seconds
Total kernel iterations: 960
Total time for grid dim 4 and element dim 5 : 0.980248
Cleanup: 0.00037760 seconds

cw_intel · ‎08-25-2021

Hi Ryan,

Did the solution provided help you fix the issue? Please let us know if this is still an issue.

Thanks.

cw_intel · ‎09-23-2021

We haven't heard back from you for a long time so we are assuming that the provided details helped you in solving your problem. We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread.