Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
27 Views

performance of half2 vector vs. half scalars per SIMD8 lane?

Basic question for GEN8+ experts:

In a SIMD8 kernel, does the GEN8+ EU achieve maximum fp16 throughput with half2 vectors per SIMD lane or are independent half scalars going to be better/worse/same? 

I am also wondering why assigning 4 half2 vectors with constants results in 8 scalar half MOVs?

Given a struct made up of 4 half2 vectors:

       a.x = 0;
       a.y = 0;
       a.z = 0;
       a.w = 1;

this is what gets generated:

         mov      (8|M0)         r79.0<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r79.8<1>:hf   0x3C00:hf                       
         mov      (8|M0)         r78.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r78.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r77.8<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.0<1>:hf   0x0:hf                          
         mov      (8|M0)         r76.8<1>:hf   0x0:hf                          

I was expecting to see a 32-bit MOV initializing each half2 member.

 

0 Kudos
1 Reply
Highlighted
Beginner
27 Views

After thinking about it, I'm going to guess that it would be better to execute a SIMD16 kernel with a half per lane instead of a SIMD8/half2.

0 Kudos