Does subgroup extension API "intel_sub_group_block_readN/writeN" have better performance than "vloadN/vstoreN"? I did some testing, but don't see much difference between them. Can you elaborate the read/write performance expectation between them?
Link Copied
The driver can help a lot with optimizing memory transfers and can often get to similar results. However, if you're already going to the trouble of using the subgroup reads and writes this can help guarantee you're getting optimal memory buffer bandwidth which could give some advantages over vload/vstore, using vector data types for memory I/O, etc.
For more complete information about compiler optimizations, see our Optimization Notice.