After thinking about it, I'm

allanmac1 · ‎12-16-2016

Basic question for GEN8+ experts:

In a SIMD8 kernel, does the GEN8+ EU achieve maximum fp16 throughput with half2 vectors per SIMD lane or are independent half scalars going to be better/worse/same?

I am also wondering why assigning 4 half2 vectors with constants results in 8 scalar half MOVs?

Given a struct made up of 4 half2 vectors:

       a.x = 0;
       a.y = 0;
       a.z = 0;
       a.w = 1;

this is what gets generated:

         mov      (8|M0)         r79.0<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r79.8<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r78.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r78.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.8<1>:hf   0x0:hf

I was expecting to see a 32-bit MOV initializing each half2 member.

allanmac1 · ‎12-19-2016

After thinking about it, I'm going to guess that it would be better to execute a SIMD16 kernel with a half per lane instead of a SIMD8/half2.

performance of half2 vector vs. half scalars per SIMD8 lane?