As per the spec, the start address of vloadn of float data type must be 4 byte aligned and not required to be 16 bytes aligned. Please correct me, if I am wrong. I would like to know the performance benefit of using vloadn in such a scenario when the buffer address is aligned on a float boundary and not float4 boundary.
Sorry I misread your original post.
Yes vloadn requires the data (address+offset*n) to be aligned to sizeof(gentype). If the data is already aligned to 16bytes I don't think there is any performance difference in either approach. If the data is only aligned to float boundary you have to use vload4 since float4 data types require 16byte alignment.