I am trying to use OpenCL kernel code to output the input image with affine transformation.
First approach is to transform the axis value of output image with inverse affine transformation sequentially then access the input image in the DDR memory randomly.
On this time, “aocx” compiler creates the private cash implicitly in the FPGA automatically, then when accessing the input image (DDR) randomly, this private cash seems to be used.
This implementation is my fastest method so far after trialing the several approaches, however I would like to improve it much more faster.
New approach is to split the output image into meshes and transform them with the inverse affine. Then save them corresponding to input image into FPGA temporally. (refer to the attached image)
The new approach is as follows:
1. Mesh the output image with rectangle area.
2. Transform the center point of one rectangle area with inverse affine.
3. Determine the which of rectangle area of input image to be corresponding using the transformed center point.
4. Save the image of rectangle area in input image into temporary input rectangle area.
5. Reserve the temporary rectangle area for output image.
6. Transform the each pixel value of temporary output rectangle area with inverse affine, and then read the corresponding pixel value from the temporary input rectangle area.
7. Continue the same operation to the all of pixels in temporary output rectangle area.
8. Write the image of temporary output rectangle area into the corresponding output image.
9. Continue the same operation to the all of rectangle areas in output image.
This method would be expected to accelerate by using the burst read from the temporary rectangle area rather than the input image (DDR Memory). Unfortunately it makes slower than using private cash method.
Is there any idea to accelerate for the affine transformation?