OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Performance of "intel_sub_group_block_readN/writeN" vs "vloadN/vstoreN"


Does subgroup extension API "intel_sub_group_block_readN/writeN" have better performance than "vloadN/vstoreN"? I did some testing, but don't see much difference between them.  Can you elaborate the read/write  performance expectation between them?

0 Kudos
1 Reply

The driver can help a lot with optimizing memory transfers and can often get to similar results.   However, if you're already going to the trouble of using the subgroup reads and writes this can help guarantee you're getting optimal memory buffer bandwidth which could give some advantages over vload/vstore, using vector data types for memory I/O, etc.  

  • Subgroup read/write is closer to what the driver would try to optimize for in any case.  If you've already arranged your data I/O to work this way this should be an optimal data access approach (for linear buffers) which will touch a minimal # of cache lines to maximize cache efficienc
  • This also optimizes the calculations needed to compute addresses and the # of addresses that need to be passed to the driver. With  subgroups only the address of the first item in the block and a length is sent, vs. an address for every work item in the subgroup