OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1720 Discussions

performance of half2 vector vs. half scalars per SIMD8 lane?


Basic question for GEN8+ experts:

In a SIMD8 kernel, does the GEN8+ EU achieve maximum fp16 throughput with half2 vectors per SIMD lane or are independent half scalars going to be better/worse/same? 

I am also wondering why assigning 4 half2 vectors with constants results in 8 scalar half MOVs?

Given a struct made up of 4 half2 vectors:

       a.x = 0;
       a.y = 0;
       a.z = 0;
       a.w = 1;

this is what gets generated:

         mov      (8|M0)         r79.0<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r79.8<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r78.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r78.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.8<1>:hf   0x0:hf                          

I was expecting to see a 32-bit MOV initializing each half2 member.


0 Kudos
1 Reply

After thinking about it, I'm going to guess that it would be better to execute a SIMD16 kernel with a half per lane instead of a SIMD8/half2.

0 Kudos