Implementing parallel reduction/ finding maximum algorithm on FPGA?
I'm trying to implement parallel algorithms from GPU to FPGA, one kernel is finding maximum using parallel reduction but obviously deploying it on FPGA will be resource wasting since half of the hardware would be idle after the first stage. I tried multiple linear search and it runs at a reasonable speed and doesn't take tons of resource. But one would think people have came up with cleaver algorithm for finding maximum/minimum on FPGA since it's such a common needed function. I did some searching but most documents I found are about implementing linear search in RTL. Anyone have idea on how this can be done efficiently? Thanks! I'm thinking of using several small linear search in autorun, taking data from channel which will steam in data from global memory. Problem is it will have a lot of read/write to global memory so I have to find a balance between latency by computation and latency by accessing memory. I guess it's unavoidable, unless using local memory which will just take all the memory resource. https://alteraforum.com/forum/attachment.php?attachmentid=14181&stc=1