Cycle accurate simulation of OpenCL kernel

UMinh · ‎02-19-2019

I wish to quantify the bottlenecks in OpenCL kernel execution and for that look to simulate the kernel. I am aware this question has been asked before and members have replied that this is not supported by Intel and also is very tricky to achieve too due to varying DRAM access times. However, is there a way to simulate only the kernel without considering DRAM access time? Or any other way to differentiate computation time from memory access times (other than the dynamic profiler)?

HRZ · ‎02-20-2019

You can probably convert your kernel to a C/C++ module and compile it using Intel's HLS compiler and then use modelsim to simulate it as per Intel's HLS documentations. However, you should note that due to the extremely low external memory bandwidth of FPGAs and the simplicity of the memory controller, your performance bottleneck will pretty much always be the external memory accesses which you cannot simulate accurately (if at all).

UMinh · ‎02-20-2019

Thank you for your suggestion. I could do that for single work item kernels but some of my kernels are NDRange. Yes it is fine if kernels are memory limited but I have multiple kernels and I want to quantify how severe memory bottleneck is in each relative to each other. If I have processing time on chip via simulation, I could subtract that from the total execution time on actual board to measure what percentage of time is spent in memory accesses.

Dr_FPGA · ‎03-05-2019

@HRZ Not true. You can simulate accurately entire OpenCL FPGA with DDRx chips models. How do you think Aletra/Intel Enginners got it working? But noone should take up such large effort unless they have no other choice.

@UMinh The DDR4 controller latency is pretty much fixed and you can find it out from simulations (see below) or SignalTap. What you are really interested in are latency and throughput from OpenCL kernel(s) global bus masters to Avalon global bus interface to DDR4 controller. This interface is in top.v at <platform>/hardware/<board> or a10_ref/hardware/a10gx for example. You can replace board interface DDRx by Avalon Bus Functional Model. Then you have to start OpenCL kernel via CRA registers with register values that you can obtain (e.g. from SignalTap or API/PCIe driver/MMD/HAL debug level logs) for various kernels that you have. The CRA registers are in <kernel>/kernel_hld/<kernel>/<kernel>_function_cra_slave.sv. Contact me on LinkedIn https://www.linkedin.com/in/drfpga/ if you need help to get this going quickly.