Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15312 Discussions

efficient global memory access for dynamic indexing

Altera_Forum
Honored Contributor I
813 Views

Hello, 

my OpenCL task (no ND-range) has a dynamic indexing access inside a loop.  

what is the maximum expected bandwidth to get from this code for 'value' array ? will coalescence access work for it? 

all arrays are very huge size global variables. 

actually what I get currently is 1 32-bit data (float) per 2 clock cycles which I guess is sub-optimum. 

 

for (unsigned i = 0; i < n; i++) 

acc = 0.0; 

ei = end_index

si = start_index

for(unsigned j = si; j < ei; j++) 

acc += value[dyn_index[j]]; 

next_value[i] = acc; 

}
0 Kudos
1 Reply
Altera_Forum
Honored Contributor I
74 Views

I explained the math behind the memory bandwidth utilization in the other thread: 

 

http://www.alteraforum.com/forum/showthread.php?t=58222 

 

(And previously here: http://www.alteraforum.com/forum/showthread.php?t=57099&p=232613

 

For such kernels I would recommend an NDRange implementation since instead of a fixed II, you will get a runtime scheduler which will try to minimize the bubbles and the stalls in the pipeline by varying the II at runtime. Furthermore, you can easily replicate your module using num_compute_units which could provide some benefit for such a kernel. However, as I explained in the other thread, random access will result in very poor memory performance regardless of what you do, and pretty much the only thing that can help is an efficient and complex memory controller and a sophisticated cache hierarchy, none of which exists on current-generation FPGAs.
Reply