Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- FPGAs and Programmable Solutions
- Intel® Quartus® Prime Software
- Achieve Low Latency in OpenCL implementation of a State Space Equation Solver

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-05-2018
09:23 AM

762 Views

Achieve Low Latency in OpenCL implementation of a State Space Equation Solver

Dear All,

I am exploring OpenCL as an alternative to HDL to implement a simple accelerator, to solve state space equations of an Induction Motor, on an FPGA PCIe board. My aim is to have a low latency implementation, with a very modest bandwidth, but able to provide the results with possibly <50us latency (including memory transfers to the host, a Xeon CPU). I implemented my first kernel, following the Intel Programming and Best Practices guide.```
typedef union{
float f;
float4 f4;
}float4_t;
typedef union{
float f;
float8 f8;
float4_t f4_t;
}float8_t;
typedef struct __attribute__((packed)){
t_float3 iabc;
t_float wr;
t_float Te;
}ker_out_t;
# define RANK 4
__attribute__((task))
__kernel void induction_machine(
__constant float4_t* restrict vabc,
__constant float* restrict theta,
__global ker_out_t* restrict ker_out,
__global float4_t* restrict AC,
__global float4_t* restrict BD
){
int tid = 0;
const float kpc = sqrt(2.0/3.0);
const float sqr2b2 = sqrt(2.0)/2.0;
const float pi2b3 = 2.0*pi/3.0;
const float sqrt3b2 = sqrt(3.0)/2.0;
const float f1b2 = 0.5;
float4_t KP;
float4_t iKP;
float4_t iabc={};
float Te;
float wr;
float4_t vdqo = {};
float4_t x;
float4_t u = {};
float4_t y = {};
float4_t xn = {};
float8_t ACx={};
float8_t BDu={};
float costh = cos(theta);
float sinth = sin(theta);
float costh_p_pi2b3 = -sqrt3b2*sinth -f1b2*costh; //cos(theta + pi2b3);
float costh_m_pi2b3 = sqrt3b2*sinth -f1b2*costh;//cos(theta - pi2b3);
float sinth_p_pi2b3 = sqrt3b2*costh -f1b2*sinth; //sin(theta + pi2b3);
float sinth_m_pi2b3 = -sqrt3b2*costh -f1b2*sinth; //sin(theta - pi2b3);
// park and inverse park coefficients
KP.f4 =(float4) (kpc*costh,kpc*costh_m_pi2b3,kpc*costh_p_pi2b3,0);
KP.f4 =(float4) (kpc*(-sinth),kpc*(-sinth_m_pi2b3),kpc*(-sinth_p_pi2b3),0);
KP.f4 = (float4) (kpc*sqr2b2,kpc*sqr2b2,kpc*sqr2b2,0);
iKP.f4 =(float4) (kpc*costh,-kpc*sinth,kpc*sqr2b2,0);
iKP.f4 =(float4) (kpc*costh_m_pi2b3,kpc*(-sinth_m_pi2b3),kpc*sqr2b2,0);
iKP.f4 =(float4) (kpc*costh_p_pi2b3,kpc*(-sinth_p_pi2b3),kpc*sqr2b2,0);
// park transform
for(int i=0;i<RANK;i++){
# pragma unroll
for(int j=0;j<RANK;j++)
vdqo.f+=KP->f*vabc->f;
}
u.f4.s01 = vdqo.f4.s01; //state space input
BDu.f8 = (float8) (0,0,0,0,0,0,0,0);
ACx.f8 = (float8) (0,0,0,0,0,0,0,0);
//state solver# pragma unroll
for(int i=0; i<2*RANK;i++){# pragma unroll
for(int j=0; j<RANK;j++){
ACx.f += AC.f*x.f;
BDu.f += BD.f*u.f;
}
}
# pragma unroll
for(int i=0;i<RANK;i++){
xn.f = x.f + h*(ACx.f + BDu.f);
y.f = ACx.f + BDu.f;
}
//torque and speed output
ker_out->Te = (3.0/2.0)*(P/2.0)*(x.f*y.f - x.f*y.f);
ker_out->wr = ker_out->wr + (P/(2*J))*(ker_out->Te - Td)*h;
//system update
AC.f = ker_out->wr-w;
AC.f = w-ker_out->wr;
//output currents inverse park transform
for(int i=0;i<RANK;i++){
# pragma unroll
for(int j=0;j<RANK;j++)
iabc.f+=iKP.f*y.f;
}
ker_out->iabc = iabc.f4.s012;
//state update
x = xn;
}
```

Profiling this code on an Arria10GX I noticed that it roughly takes 70us to execute. Do you think this result is reasonable? Is there a way to reduce this figure? Running the kernel several times inside a for loop in the host, executing using EnqueueTask, I also noticed in the output of aocl report that a lot of time is spent in between executions. Does that part represent the memory transfer ? A screenshot of the profiling timeline is in the attachments Any suggestion is appreciated. Thank you in advance, Peter https://alteraforum.com/forum/attachment.php?attachmentid=15477&stc=1
Link Copied

3 Replies

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-05-2018
12:09 PM

27 Views

To be honest OpenCL is not suitable for low-latency design since the compiler is actually designed to sacrifice latency and maximize throughput by using a very deep pipeline that efficiently absorbs stalls from external memory.

I haven't used the profiler in a very long time but from what I remember, the profiler should also show host to device transfers alongside with the kernel run time; this could be the reason for the gaps. Memory transfers between the FPGA and its external memory are included in the kernel run time. ...
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-05-2018
12:37 PM

27 Views

Thank you for the quick reply,

Would you suggest any other development flow apart from OpenCL and HDL to tackle this problem. The main issue with HDL in my case is the long development cycle, which is not suitable to implement algorithms that are frequently changed and refined in my case. Thanks again
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-05-2018
02:27 PM

27 Views

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.