Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15330 Discussions

Achieve Low Latency in OpenCL implementation of a State Space Equation Solver

Honored Contributor II

Dear All,  


I am exploring OpenCL as an alternative to HDL to implement a simple accelerator, to solve state space equations of an Induction Motor, on an FPGA PCIe board. 

My aim is to have a low latency implementation, with a very modest bandwidth, but able to provide the results with possibly <50us latency (including memory transfers to the host, a Xeon CPU). 


I implemented my first kernel, following the Intel Programming and Best Practices guide.  


typedef union{ float f; float4 f4; }float4_t; typedef union{ float f; float8 f8; float4_t f4_t; }float8_t; typedef struct __attribute__((packed)){ t_float3 iabc; t_float wr; t_float Te; }ker_out_t; # define RANK 4 __attribute__((task)) __kernel void induction_machine( __constant float4_t* restrict vabc, __constant float* restrict theta, __global ker_out_t* restrict ker_out, __global float4_t* restrict AC, __global float4_t* restrict BD ){ int tid = 0; const float kpc = sqrt(2.0/3.0); const float sqr2b2 = sqrt(2.0)/2.0; const float pi2b3 = 2.0*pi/3.0; const float sqrt3b2 = sqrt(3.0)/2.0; const float f1b2 = 0.5; float4_t KP; float4_t iKP; float4_t iabc={}; float Te; float wr; float4_t vdqo = {}; float4_t x; float4_t u = {}; float4_t y = {}; float4_t xn = {}; float8_t ACx={}; float8_t BDu={}; float costh = cos(theta); float sinth = sin(theta); float costh_p_pi2b3 = -sqrt3b2*sinth -f1b2*costh; //cos(theta + pi2b3); float costh_m_pi2b3 = sqrt3b2*sinth -f1b2*costh;//cos(theta - pi2b3); float sinth_p_pi2b3 = sqrt3b2*costh -f1b2*sinth; //sin(theta + pi2b3); float sinth_m_pi2b3 = -sqrt3b2*costh -f1b2*sinth; //sin(theta - pi2b3); // park and inverse park coefficients KP.f4 =(float4) (kpc*costh,kpc*costh_m_pi2b3,kpc*costh_p_pi2b3,0); KP.f4 =(float4) (kpc*(-sinth),kpc*(-sinth_m_pi2b3),kpc*(-sinth_p_pi2b3),0); KP.f4 = (float4) (kpc*sqr2b2,kpc*sqr2b2,kpc*sqr2b2,0); iKP.f4 =(float4) (kpc*costh,-kpc*sinth,kpc*sqr2b2,0); iKP.f4 =(float4) (kpc*costh_m_pi2b3,kpc*(-sinth_m_pi2b3),kpc*sqr2b2,0); iKP.f4 =(float4) (kpc*costh_p_pi2b3,kpc*(-sinth_p_pi2b3),kpc*sqr2b2,0); // park transform for(int i=0;i<RANK;i++){ # pragma unroll for(int j=0;j<RANK;j++) vdqo.f+=KP->f*vabc->f; } u.f4.s01 = vdqo.f4.s01; //state space input BDu.f8 = (float8) (0,0,0,0,0,0,0,0); ACx.f8 = (float8) (0,0,0,0,0,0,0,0); //state solver# pragma unroll for(int i=0; i<2*RANK;i++){# pragma unroll for(int j=0; j<RANK;j++){ ACx.f += AC.f*x.f; BDu.f += BD.f*u.f; } } # pragma unroll for(int i=0;i<RANK;i++){ xn.f = x.f + h*(ACx.f + BDu.f); y.f = ACx.f + BDu.f; } //torque and speed output ker_out->Te = (3.0/2.0)*(P/2.0)*(x.f*y.f - x.f*y.f); ker_out->wr = ker_out->wr + (P/(2*J))*(ker_out->Te - Td)*h; //system update AC.f = ker_out->wr-w; AC.f = w-ker_out->wr; //output currents inverse park transform for(int i=0;i<RANK;i++){ # pragma unroll for(int j=0;j<RANK;j++) iabc.f+=iKP.f*y.f; } ker_out->iabc = iabc.f4.s012; //state update x = xn; }  


Profiling this code on an Arria10GX I noticed that it roughly takes 70us to execute.  


Do you think this result is reasonable? Is there a way to reduce this figure? 

Running the kernel several times inside a for loop in the host, executing using EnqueueTask, I also noticed in the output of aocl report that a lot of time is spent in between executions. Does that part represent the memory transfer ? 

A screenshot of the profiling timeline is in the attachments 


Any suggestion is appreciated.  


Thank you in advance,  


0 Kudos
3 Replies
Honored Contributor II

To be honest OpenCL is not suitable for low-latency design since the compiler is actually designed to sacrifice latency and maximize throughput by using a very deep pipeline that efficiently absorbs stalls from external memory. 


I haven't used the profiler in a very long time but from what I remember, the profiler should also show host to device transfers alongside with the kernel run time; this could be the reason for the gaps. Memory transfers between the FPGA and its external memory are included in the kernel run time. 


Honored Contributor II

Thank you for the quick reply,  


Would you suggest any other development flow apart from OpenCL and HDL to tackle this problem. The main issue with HDL in my case is the long development cycle, which is not suitable to implement algorithms that are frequently changed and refined in my case. 


Thanks again
Honored Contributor II

Even though I haven't used them myself, many are using System C or System Verilog these days as a higher-productivity alternative to VHDL and Verilog while retaining the low-level control over the pipeline latency.