According to Gen8.pdf,
'These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD execute up to eight 16-bit integer or 16-bit floating-point operations.'
It seems that INT16 can achieve 2x peak throughput compared with INT32.
In Gen8.pdf, the table shows that for HD Graphics 5300, 32b integer IOPS = 192 IOP/cyc. Then, does it mean 16bit integer IOPS = 192*2 IOP/cyc?
Is my understanding right?
Asking this, because from my test, I can hardly see 2x throughput increase when using 'short' data type. I get almost same performance between when I switch from 'int' data type to 'short'.
The first question I would ask is whether you are bandwidth or compute limited or thread launch overhead limited.
For bandwidth, it is good to read/write one int per work item, but better to read/write 4 ints per work item (int4).
If you are just reading/writing one short per work item, your performance will suffer (probably too much thread launch overhead), better do it with short2 or better yet short8s - when using short8s or int4s you will get the best bandwidth.
You should be able to process short8s in the same amount of time as int4s, so you effectively double your OPS. But first, make sure that you are reading and writing short8s (possibly in a loop of 4, 8, or 16 to futher reduce thread launch overhead). Please see my videos and code samples on Optimizing Simple OpenCL kernels here https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels .