OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions




I have a target as follow:

  • Ubuntu 16.04LTS
  • Intel Celeron J3160 (GPU Intel HD Graphics 400)
  • Intel® SDK for OpenCL™ Applications 2016 R3
  • Intel driver for OpenCL - intel-opencl-xxx-r3.0-57406.x86_64
  • intel graphic driver i915 v

I have tried to query device info preferred witdh and got following result:

    Device Name = Intel(R) HD Graphics
    Device Vendor = Intel(R) Corporation
    Preferred vector width in chars: 16
    Preferred vector width in shorts: 8
    Preferred vector width in ints: 4
    Preferred vector width in longs: 1
    Preferred vector width in floats: 1
    Preferred vector width in doubles: 0
    Preferred vector width in halfs: 8

It clearly appears that there is 128bits vectorization, at least for chars,shorts,ints and halfs.
But strangely not for floats and longs?

1) Am I right ?
2) How to explain that ?

Also we plan to go to next generation with an Intel N3350 (HD Graphics 500).
3) So would it be the same ?

4) Where can I find documentation about that, on HD Graphics 400 and HD Graphics 500 ?

Thank you

0 Kudos
3 Replies

It is definitely possible to use 128 bit vectors of floats and longs.  However, use of vector types matters less in terms of instruction efficiency than might be intuitive.  If you check the gen assembler output for a kernel implemented with 1) with vector types and 2) same operations as scalar, the OpenCL code produce is nearly identical in many cases.  (However, "wider" work item widths considering vector sizes can be more efficient in terms of memory movement and thread scheduling.)  I suspect this is part of the reason for reporting a preferred/native width of 1 but I will see if I can find more details.

HD Graphics 500 is Gen9 processor graphics architecture.  Architecturally, much is the same as HD graphics 400 (Gen8).

Gen9 architecture guide:

Gen8: has info on EU count and frequency for each processor, which are important for performance.  From here I can see that Celeron J3160 and N3350 both have 12 EUs and similar frequencies. also has great info.





I'd also be interested in understanding if there is any benefit to working with a half8/short8/char16 types per sub_group work item.

I would've guessed half2/short2/char4 would've been the optimal choice so I'm curious why they are 8/8/16...


Thank you Jeffrey for you quick reply.

However can you develop what you mean by :

"wider" work item widths considering vector sizes can be more efficient in terms of memory movement and thread scheduling

Do you mean that using scalar (and so "wider" work items widths) instead of vectors would be more efficient in terms of memory movement and thread scheduling or do you mean the opposite or may be totally other things ?

And so why a preferred width of 1 for longs and 4 for ints ?

Also, I'm obviously interested if you find more details as you mentioned