I see that an 18 EU Apollo Lake SKU like the N4200 may appear as either 3 sub-slices of 6 EUs (with 6 hardware threads per EU) *or* as 2 virtual sub-slices of 9 EUs.
As a developer, what should my mental model be when programming this device?
Does the Windows driver implement "Legacy Mode" 3x6 EU sub-slices or the new virtual 2x9 EU configuration?
How much SLM is available to each sub-slice in either mode?
This doc makes Legacy Mode sound ominous: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bxt-vol06-3d_media_gpgpu.pdf
This doc is useful too: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bxt-vol03-configurations.pdf
Sorry for the delay getting back to you. I am working on a longer article on this topic which should be available in a few days.
For now though, the mental model should be 2x9. Legacy mode/MEDIA_POOL_STATE isn't accessible via OpenCL. We could not find any cases where relaxing the pooling model made times worse. Extensive testing showed that, worst case, the relaxed rules broke even with legacy mode -- but in many cases the benefits got close to 2x.
SLM, subgroups, etc. are tied to threads and their pools -- the reason they are tied to HW elsewhere is the thread pool configuration, not any underlying physical reason.
The biggest difference to expect is optimal work group sizes will match the virtual 2x9 configuration since things like EU thread occupancy will be affected by the pool size, which in this special case is different than the physical HW.
I look forward to reading your article!
From the block diagrams it looks like there are 2 x 64KB SLM blocks on the slice? It would be great if your article could cover this.
One of the things I'm always interested in is how to fully utilize the GPU by selecting kernel SLM requirements and workgroup sizes that won't block another workgroup from being scheduled on spare EU threads.
For example, there is still a 256 thread workgroup limit so to "cover" the entire 18 EU device (BXT has 6 HW threads per EU) with a SIMD8 kernel would, at the largest, require 4 x 216 thread workgroups and each workgroup could have as much as 32 KB of SLM.
Some benchmarks would be great... I've been performing some basic testing on an N4200 / HD505 / dual-channel 1866 RAM device and it's very peppy considering it's passively cooled and the processor is using single digit Watts.
Article is here: https://software.intel.com/en-us/articles/ocl-threadpools-apl
Yes, you'll get 2 SLM blocks for HD Graphics 505 since it is effectively 2x9, 1 for HD Graphics 500 since it is 1x12.
Definitely agree that benchmarks would be nice to have. Any feedback on algorithms for benchmarks you would like to see could help our planning.