Sorry for the delayed reply. If you have not seen already, the optimization guide has some good tips in its "Memory Access Considerations" section. In general, loading vector types is always more efficient than single values. Using vector types for memory transfers makes it easier for the compiler to create efficient code. If the compiler recognizes the pattern in kernel2 it might combine loads into a pattern close enough to kernel1 that the performance will be practically identical, but kernel2 isn't expected to have any advantages over kernel1.
As I've asked around, what I've understood is that vload is always better. Single loads may be combined by the compiler so in many cases there may not be a big difference, but vload is expected to be the best case. There may be a few corner cases with odd borders where single loads may have a minor advantage but I suspect these are rare. In the future I'm hoping we will be able to do more memory transfer optimization tutorials including deeper analysis so the guidelines do not need to be so high level. Please watch for more documentation/example improvements as we can prioritize them in the future.
Thanks for pursuing this. I am particularly interested in avoiding bank conflicts. Is there some way
of confirming that vload does not trigger more bank conflicts than separate loads?
Thanks so much,