- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Basic question for GEN8+ experts:
In a SIMD8 kernel, does the GEN8+ EU achieve maximum fp16 throughput with half2 vectors per SIMD lane or are independent half scalars going to be better/worse/same?
I am also wondering why assigning 4 half2 vectors with constants results in 8 scalar half MOVs?
Given a struct made up of 4 half2 vectors:
a.x = 0; a.y = 0; a.z = 0; a.w = 1;
this is what gets generated:
mov (8|M0) r79.0<1>:hf 0x3C00:hf mov (8|M0) r79.8<1>:hf 0x3C00:hf mov (8|M0) r78.0<1>:hf 0x0:hf mov (8|M0) r78.8<1>:hf 0x0:hf mov (8|M0) r77.0<1>:hf 0x0:hf mov (8|M0) r77.8<1>:hf 0x0:hf mov (8|M0) r76.0<1>:hf 0x0:hf mov (8|M0) r76.8<1>:hf 0x0:hf
I was expecting to see a 32-bit MOV initializing each half2 member.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After thinking about it, I'm going to guess that it would be better to execute a SIMD16 kernel with a half per lane instead of a SIMD8/half2.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page