Re: Fast N-Body application in embedded RAM (Cyclone 10 GX)

Altera_Forum · ‎01-26-2018

Hey everyone,

I'm working on a prototype for complex image processing tasks based on the new Cyclone 10 GX family. The pipeline I'm implementing consists of some pre-processing steps (smoother, transformation) and needs to write each line individually into a SRAM to initiate the next step. This next step consists of some kind of N-Body application, where each pixel needs to be compared to each other pixel in a relative range to find a line position, where some specific properties are matched. The algorithm requires to analyze each pixel sequentially, so there is no possibility to make this step out-of-order. In the current implementation, I wait until the line buffer is filled, and after that I read the first pixel (as the reference pixel) and all other pixels (the compare pixels) sequentially and push them into the pipeline. Now here's the problem: If I have images of width 1280 pixels and a relative range of 64 pixels, then each line requires 1280 * 64 cycles to complete, slowing the previous pre-processing steps extremely down. Two solutions exist: I can either process each line in a multiplexed scheme, effectively doubling RAM and logic requirements, or I can read multiple compare pixels parallel out of each line buffer to parallelize the pixel comparison step. Since I can't implement 64 multiplexed line pipelines to achieve full performance, I need to use additionally fine-grained parallelization. The next problem is: If I want to read multiple pixels in parallel out of a line buffer, I can either duplicate the line buffers to allow multiple read pointers, or increase frequency to speedup reads (and move them into slower clock domain in a second step). Both ways are risky, since SRAM is quite constrained, and base frequency is already 200 MHz. Now the question: How would you solve such a problem?

Altera_Forum · ‎01-27-2018

Can you describe your N-Body algorithm in more detail? Perhaps share some links to relevant papers, blogs, ...

Assuming that the processing module accepts all 64 pixel at the same time, and is fully pipelined, all you have to do is to create a ping-pong buffer where you write the pixel data one (or perhaps 2 or 4 ) at the time but read 64 pixels on the other side. Assuming 8-bit pixel data, this would require 16 RAM-blocks.

Regards,

Josy