Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16882 Discussions

Optimize Memory Accesses FPGA

Altera_Forum
Honored Contributor II
1,240 Views

Hi everyone. I am developing a code that has to evaluate some floating point values based on some data. So my current approach is that each evaluation for each cell of an int vector is done by a work item. 

This int vector contains the indexes to access some other data. This other data has to be read from global memory and it mainly consist of structures. These structures are used to evaluate the floating point measure I need that will be then saved 

to a vector of floating points for each work-item id. I'll provide a snippet of code. So the problem here is that this code scales very well on my GPU but not really for FPGA, actually as I double the work-items the timing increases quite bad. 

So here I am very confused how I should optimize the memory accesses... I can't find much online and I am struggling a lot about this problem because it appears to be present in every code I write. So thanks to anyone that can actually clarify or explain the proper approach to optimize the memory accesses. Thanks 

 

int tid = get_global_id(0); //WORK-ITEM FOR EACH SOLUTION int groupIndex = tid * b_sizes; int indexWU = tid * b_sizes; for (int i = 0; i < b_sizes; i++) { float totalC = 0; int rIndex = b_solutions; if (rIndex != -1) { struct SO o= b_o; struct SR r= b_r; float freC = 0; if (r.sL != CRF) { float tW = b_wU; freC = getWBC(tW, r, b_wB) / tW*o.w; } float whC = r.whC*o.d; totalC += freC + whC; } b_solPerf += totalC; }
0 Kudos
1 Reply
Altera_Forum
Honored Contributor II
517 Views

You are performing random indirect accesses; the type of behavior you are observing is expected. The FPGA memory controller is extremely inefficient for random accesses and considering the low external memory bandwidth (25-35 GBps), you cannot get much scaling by increasing the number of threads either. Furthermore, the FPGA doesn't have a smart cache that can properly handle redundant random accesses. In contrast, on GPUs you get over 10 times external memory bandwidth, a much more efficient memory controller, and two levels of smart caches. I'm afraid there isn't much you can do to improve the performance of random indirect memory accesses. If you could at least make your accesses direct, it would probably help, but at the end of the day, if you want good memory performance on an FPGA, you need to have large, coalesced and aligned memory accesses.

0 Kudos
Reply