Hi everyone. I have compiled my kernel in profiling mode and I am now looking at the report after the execution of the program and kernel. It looks like there is a problem of bandwidth efficiency.By looking a the code you'll be able to tell that I am using matrices of data in the buffers and I need the indexes to access the data. Is there anyone that can tell me more about this problem because I am not sure how to improve the code to solve this issue. Thanks Columns in the image by order: Code | Attributes | Stall%| Occupancy% | Bandwidth https://preview.ibb.co/nebu1r/untitled.png (https://ibb.co/ijzso6)
Depending on the operating frequency of your kernel and external memory on your board, and also the number of external memory banks, you need to read or write a certain number of bytes per clock to be able to fully utilize the external memory bandwidth. What the profiler is trying to tell you is that you are not fully utilizing the external memory bandwidth since your accesses are too narrow. You can increase the bandwidth efficiency by unrolling the loop that is iterating over your external memory reads/writes. Note that you can also fully utilize the bandwidth if you have multiple narrow reads/writes but this will also result in a high amount of collision on the memory bus and lots of stalls in the pipeline.
Does anyone know which are the best rules to follow to optimize an algorithm for memory accesses? And how to manually partition the blocks of memory to favor one over the other for example (disabling memory interleaving). I can't find much information over the internet and I really need to improve my algorithm because this memory access problem is slowing down a lot my algorithm.. Thanks.
To maximize memory performance you should:- Minimize the number of global memory accesses/ports; global memory ports use quite a bit of area and the more ports you have, the more contention will happen on the memory bus, resulting in stalls that can potentially propagate all the way down the pipeline. If applicable, using structs could be beneficial in this case since it will allow you to fetch multiple different variables using one single memory port/access. - Unroll loops iterating over memory accesses so that the compiler will coalesce the contiguous accesses to one larger access, enabling you to better utilize the memory bandwidth. Using vector types can also yield the same result. The "Best Practices Guide" also includes some guidelines regarding this matter in Section 1.8.1. Regarding disabling memory interleaving: doing so will pretty much never improve performance. The only case I have seen that doing so improves performance is when you have a very simple kernel with only one read and write, and a very short pipeline. When you have multiple accesses, interleaving pretty much always improves performance. However, if you want to favor some accesses over others by manual banking, you can put your more important buffers in different banks, and distribute the less important ones in a balanced fashion between the banks. Or if you have a few important buffers, but many less important ones, you can put the first set in one bank, and the second set in the other, so that accesses to the more important buffers get a bigger share of the bandwidth.