I have a target as follow:
- Ubuntu 16.04LTS
- Intel Celeron J3160 (GPU Intel HD Graphics 400)
- Intel® SDK for OpenCL™ Applications 2016 R3
- Intel driver for OpenCL - intel-opencl-xxx-r3.0-57406.x86_64
- intel graphic driver i915 v 4.7.0.intel.r3.0
I have tried to query device info preferred witdh and got following result:
Device Name = Intel(R) HD Graphics
Device Vendor = Intel(R) Corporation
Preferred vector width in chars: 16
Preferred vector width in shorts: 8
Preferred vector width in ints: 4
Preferred vector width in longs: 1
Preferred vector width in floats: 1
Preferred vector width in doubles: 0
Preferred vector width in halfs: 8
It clearly appears that there is 128bits vectorization, at least for chars,shorts,ints and halfs.
But strangely not for floats and longs?
1) Am I right ?
2) How to explain that ?
Also we plan to go to next generation with an Intel N3350 (HD Graphics 500).
3) So would it be the same ?
4) Where can I find documentation about that, on HD Graphics 400 and HD Graphics 500 ?
It is definitely possible to use 128 bit vectors of floats and longs. However, use of vector types matters less in terms of instruction efficiency than might be intuitive. If you check the gen assembler output for a kernel implemented with 1) with vector types and 2) same operations as scalar, the OpenCL code produce is nearly identical in many cases. (However, "wider" work item widths considering vector sizes can be more efficient in terms of memory movement and thread scheduling.) I suspect this is part of the reason for reporting a preferred/native width of 1 but I will see if I can find more details.
HD Graphics 500 is Gen9 processor graphics architecture. Architecturally, much is the same as HD graphics 400 (Gen8).
ark.intel.com has info on EU count and frequency for each processor, which are important for performance. From here I can see that Celeron J3160 and N3350 both have 12 EUs and similar frequencies.
http://www.notebookcheck.net also has great info.
I'd also be interested in understanding if there is any benefit to working with a half8/short8/char16 types per sub_group work item.
I would've guessed half2/short2/char4 would've been the optimal choice so I'm curious why they are 8/8/16...
Thank you Jeffrey for you quick reply.
However can you develop what you mean by :
"wider" work item widths considering vector sizes can be more efficient in terms of memory movement and thread scheduling
Do you mean that using scalar (and so "wider" work items widths) instead of vectors would be more efficient in terms of memory movement and thread scheduling or do you mean the opposite or may be totally other things ?
And so why a preferred width of 1 for longs and 4 for ints ?
Also, I'm obviously interested if you find more details as you mentioned