Solved: GPU float16 memory access efficiency

Georg_K_ · ‎12-23-2015

According to https://software.intel.com/en-us/node/540447, code like that below is inefficient as each work item in a subgroup (SIMD width) touches a different cache line (and only reads 4 bytes from each).

__global int*   myArray = ...;
int x;
x = myArray[ get_global_id(0) * 16 ];

However, what happens when you access 16-wide vectors instead? Will the compiler issue a 64-byte wide memory read per work item thus reading a full cache line per access, or will it treat each vector component individually, resulting in the same reduced memory bandwidth?

__global float16*   myArray = ...;
float16 x;
x = myArray[ get_global_id(0) ];

Robert_I_Intel · ‎01-12-2016

Georg,

Very interesting question: took me a while to research :)

I took the following code:

kernel void foo(global float* in, global float* out) { 
 int i = get_global_id(0);

 float f = in;
 float temp = 0.5f * f;
 out = temp;
}

kernel void foo2(global float2* in, global float2* out) { 
 int i = get_global_id(0);

 float2 f = in;
 float2 temp = 0.5f * f;
 out = temp;
}

kernel void foo4(global float4* in, global float4* out) { 
 int i = get_global_id(0);

 float4 f = in;
 float4 temp = 0.5f * f;
 out = temp;
}

kernel void foo8(global float8* in, global float8* out) { 
 int i = get_global_id(0);

 float8 f = in;
 float8 temp = 0.5f * f;
 out = temp;
}

kernel void foo16(global float16* in, global float16* out) { 
 int i = get_global_id(0);

 float16 f = in;
 float16 temp = 0.5f * f;
 out = temp;
}

and compiled it with our latest Intel SDK for OpenCL Applications:

OpenCL Intel(R) Graphics device was found!

Device name: Intel(R) Iris(TM) Graphics 6100

Device version: OpenCL 2.0

Device vendor: Intel(R) Corporation

Device profile: FULL_PROFILE

fcl build 1 succeeded.

fcl build 2 succeeded.

bcl build succeeded.

foo info:

Maximum work-group size: 256

Compiler work-group size: (0, 0, 0)

Local memory size: 0

Preferred multiple of work-group size: 32

Minimum amount of private memory: 0

foo2 info:

Maximum work-group size: 256

Compiler work-group size: (0, 0, 0)

Local memory size: 0

Preferred multiple of work-group size: 32

Minimum amount of private memory: 0

foo4 info:

Maximum work-group size: 256

Compiler work-group size: (0, 0, 0)

Local memory size: 0

Preferred multiple of work-group size: 32

Minimum amount of private memory: 0

foo8 info:

Maximum work-group size: 256

Compiler work-group size: (0, 0, 0)

Local memory size: 0

Preferred multiple of work-group size: 16

Minimum amount of private memory: 0

foo16 info:

Maximum work-group size: 256

Compiler work-group size: (0, 0, 0)

Local memory size: 0

Preferred multiple of work-group size: 16

Minimum amount of private memory: 0

Build succeeded!

To summarize: float, float2, and float4 versions were compiled SIMD32 and float8 and float16 versions were compiled SIMD16. Now, lets look at the assembly generated by the read instruction in all those kernels:

foo:

         send     (8|M0)        r27           r19               0xC       0x4146EFF                 // id:2
         send     (8|M8)        r28           r21               0xC       0x4146EFF                 // id:2
         send     (8|M16)       r29           r23               0xC       0x4146EFF                 // id:2
         send     (8|M24)       r30           r25               0xC       0x4146EFF                 // id:2

In SIMD32 mode, two send instructions act as one, so we actually have two sends here. One register holds 8 floats, so it takes 4 registers to keep the values of one read.

foo2:

         send     (8|M0)        r31           r19               0xC       0x4246CFF                 // id:2
         send     (8|M8)        r23           r21               0xC       0x4246CFF                 // id:2
         send     (8|M16)       r35           r25               0xC       0x4246CFF                 // id:2
         send     (8|M24)       r29           r27               0xC       0x4246CFF                 // id:2

Though those sends (loads) look similar to foo kernel, each actually loads two registers worth of data, so one instruction (two sends) loads 4 registers, for a total of 8 registers for the code above. If you look at the full assembly listing, you will notice several moves to get the loaded data in a shape acceptable to multiply instruction. Nevertheless, the sends are the expensive ones.

foo4:

         send     (8|M0)        r19           r15               0xC       0x44460FF                 // id:2
         send     (8|M8)        r23           r17               0xC       0x44460FF                 // id:2
         send     (8|M16)       r31           r27               0xC       0x44460FF                 // id:2
         send     (8|M24)       r35           r29               0xC       0x44460FF                 // id:2

Again, four sends here (actually two SIMD32 instructions), but now we load 8 registers worth of data. The number of moves to prepare data for multiplication increases even more, but the number of sends stays the same - we just load more data.

foo8:

         send     (8|M0)        r50           r72               0xC       0x44460FF                 // id:1
         send     (8|M8)        r76           r74               0xC       0x44460FF                 // id:1
         send     (8|M0)        r58           r68               0xC       0x44460FF                 // id:2
         send     (8|M8)        r64           r70               0xC       0x44460FF                 // id:2

Now were are in SIMD16 land, so we have 4 sends that are 4 SIMD16 instructions. Each loads 4 registers worth of data. In fact, the amount of data here is similar to the previous case, it is just that we have 2X data per SIMD lane.

foo16:

         send     (8|M0)        r82           r120              0xC       0x44460FF                 // id:2
         send     (8|M8)        r18           r122              0xC       0x44460FF                 // id:2
         send     (8|M0)        r90           r14               0xC       0x44460FF                 // id:2
         send     (8|M8)        r10           r16               0xC       0x44460FF                 // id:2
         send     (8|M0)        r98           r6                0xC       0x44460FF                 // id:2
         send     (8|M8)        r124          r8                0xC       0x44460FF                 // id:2
         send     (8|M0)        r106          r116              0xC       0x44460FF                 // id:2
         send     (8|M8)        r112          r118              0xC       0x44460FF                 // id:2

Now, we finally get to double the number of sends. Still each send loads 4 registers worth of data. Each register will have 8 floats. And there will be a whole lot of moves to prepare data for multiplication (compiler generates 16 mul operations, each working on two registers worth of floats).

So, each send is capable of bringing a maximum of 128 bytes, or 32 floats to the thread. For more efficient loading you will need to look into bulk media loads, which are part of intel simd shuffle extension.

Note that in terms of efficiency of loading the data float4 in SIMD32 case and float8 in SIMD16 case are equivalent.So you could say that we are sort of doing N-wide loads.

View solution in original post

Jeffrey_M_Intel1 · ‎12-30-2015

Sorry for the delayed reply.

In general, the compiler does a good job of optimizing memory I/O. In some quick experiments with the SimpleOptimizations performance was close for float, float4, and float8. As expected, the kernels based on vector types were faster. Performance with float16 was slower than expected.

experiment kernel runtime (ms)

float 3390
float4 3215
float8 3020
float16 4215

(Run on i5-4570, Linux Media Server Studio 2015 R6)

Investigating now. Will get back to you soon with more info.

Robert_I_Intel · ‎01-06-2016

Hi Georg,

The optimal data type for our architecture is typically float4 (or uint4, int4, or char16 or uchar16 - basically 16 bytes from a work-item). Occasionally, you may get good performance from float8 (typically, very short kernels as you can see above). Using float16 is not recommended for couple of reasons:

1) you typically end up using way too much private memory per work item so you will be bumped from SIMD32 to SIMD16 compilation for small kernels and from SIMD16 to SIMD8 for larger ones and if you are already at SIMD8, there is a high probability of spills to global memory, which you don't want;

2) You will probably end up with too much compute per hardware thread (typically you have 8 threads per EU (Ivy Bridge) or 7 threads per EU (Haswell and Broadwell and Skylake) and only 2 SIMD4 FPUs per EU, so when using float16, you are most likely will be compute limited for any reasonably sized kernel.

3) You are reading/writing much more data that the optimal spot of the architecture, so you are definitely bandwidth limited with float16.

With float16, your kernel will probably compile SIMD16 (it is very short) or SIMD8. In the first case you end up reading 16 * 4 * 16 = 1024 bytes of data from a hardware thread - 4 times the optimal amount (16 cache lines worth of data, as opposed to 4 or 8, which is optimal).

Georg_K_ · ‎01-11-2016

Hi Robert and Jeffrey,

Thanks for the detailed answer.

Re 1) I am working with a Broadwell/Iris 6100 system at the moment. My understanding is that Gen8 GPUs have 28 KiB of general purpose registers per Execution Unit, 4 KiB per hardware thread. In SIMD8 mode each work item should thus have 128 dwords of private memory available (64 and 32 dwords for SIMD16 and SIMD32, respectively). Of those I imagine some will be used for work item IDs, kernel args and other internal values (I seem to recall this from the Beignet source code), so the number available to user code is somewhat smaller. Still, using a couple of float16 values will only cause spilling if there's a significant amount of other live registers.

While I indeed intend to be careful not to spill to global memory, I don't think the documentation mentions any drawbacks of using the register space fully (since registers are assigned to threads on a fixed basis there is no register/thread occupancy trade-off as on other GPUs).

By the way, I think it's a shame the Intel OpenCL implementation seems to have no way of forcing a certain SIMD width, or, in fact, no way of directly using an EU thread as a single work item. While the cl_intel_subgroups extension does much to allow work items of the same subgroup to communicate efficiently, my kernel code would actually be quite a bit more readable (and possibly more efficient) if written as mostly using float16 operations on arrays spanning the whole 4 KiB of private memory.

Re 2) Whether I choose to use float16 or a smaller type will have no bearing on the computation my kernel needs to perform nor on the number of values I need to read from memory. The question is whether I will have to rearrange the values in memory to get good utilisation of bandwidth or whether it's ok to process 16 consecutive values in a single work item.

Ultimately my question comes down to whether the compiler can generate load ops that are effectively transposed, i.e. instead of generating a sequence of N loads which each read one float per work item in parallel, generate a sequence of N-wide loads, one per work item.

Robert_I_Intel · ‎01-12-2016