OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

work group with 1 work item using ~100 float8 vectors?

allanmac1
Beginner
334 Views

Will the Intel HD Graphics OpenCL compiler support "1 work item" work groups that are float8 vectors?

Example:

__kernel
__attribute__((vec_type_hint(float8),reqd_work_group_size(1,1,1)))
void __kernel(__global const float8* const restrict in, __global float8* const restrict out)
{
  ... // lots and lots of float8 vector registers
}

The goal is to occupy as many float8 registers as possible in a single work item.  The kernel I'm designing can benefit from float4 swizzling ops and I'm assuming float8 is the narrowest width that matches the 128x8 register file found in Ivy and Haswell architectures.

Questions:

  • Does the HD Graphics OpenCL compiler support allocating as many as 128 registers on IvyBridge and Haswell?
  • If this isn't supported, why no?
  • If this isn't support then what is the best work group size to acquire the most possible registers per work item?

Thanks, I'm very impressed with the HD Graphics architecture.  The EUs and sub-slices appear to have *huge* amounts of resources compared to other low power GPUs.

 

0 Kudos
3 Replies
allanmac1
Beginner
334 Views

Just to be clear, my question is will the compiler map a vec_type_hint'ed float8 work-group of 1 work item onto an EU's 128x8 general register file?  

The reason why I ask is that I have a kernel that is very SIMD and not very SIMT and would map perfectly onto an EU thread and its 128x8 register file.

Thanks, I'd really like to get an answer!

0 Kudos
Raghupathi_M_Intel
334 Views

IVB and HSW have 128 256-bit registers in the GRF. So the float8 should fit perfectly in each register. I dont think the compiler imposes any restriction on how many of these available registers a program can use. Also note that there are 128 registers per thread.

I have asked the experts for more details but if you see behavior otherwise, please do let us know.

Thanks,
Raghu

0 Kudos
allanmac1
Beginner
334 Views

Thanks Raghu, that's great news.

128 registers per EU thread is stunning and more HD Graphics OpenCL devs should be made aware of how why this is useful!

A quick napkin calculation shows that the HD5x00 series has an immense amount of resources:

  • 256 KB of shared -- 4 sub-slices x 64KB
  • 1120 KB of registers -- 4 sub-slices x 10 EUs x 7 threads x 128x8 32-bit registers
  • 320 ALUs -- issuing a max of 1 or 2 ops per clock

This is very good and is even more resources than some entry-level discrete GPUs.

0 Kudos
Reply