Solved: Performance of "cl_intel_subgroup_split_matrix_multiply_accumulate"?

allanmac · ‎09-05-2023

After looking at the cl_intel_subgroup_split_matrix_multiply_accumulate extension as well as the `dot product accumulate systolic wide` (dpasw) instruction, I'm wondering if there is any penalty compared to the dpas instruction?

I understand that each 8-wide subgroup only needs to load half of the "A" (dpasw:src2) matrix into the GRF.

Does this also imply:

The two 8-wide subgroups (EU threads) only cooperate for the duration of dpasw?
An identically-valued Matrix "B" has been loaded into both cooperating subgroups GRF?
While the "split_matmul" is executing, the lifetimes of the "Accumulator" and "Results" matrices in each subgroup's register footprint might also be implicitly halved?
Upon completion, the two EU threads exchange (and EU1 adjusts) their accumulators resulting in the full-sized "Results" matrix in each subgroup?

So is this only useful under severe register pressure?

Are there any other benefits?

Thanks!

Allan

@Ben_A_Intel

Ben_A_Intel · ‎09-05-2023

Hi Allan, good questions, although please note explicitly that we've been intentionally vague in the description of this feature because it's something we tried in one generation of GPUs and it didn't work out as well as we liked. We may drop support for this extension for future GPUs.

It's most helpful to understand how this extension (and "dpasw") works relative to the non-split cl_intel_subgroup_matrix_multiply_accumulate (and "dpas"):

Each subgroup is computing an MxN result matrix in both cases.
Each subgroup is loading and passing its own unique KxN "b" matrix and MxN "acc" matrix sources in both cases.
In the non-split "dpas" case, each subgroup passes its own MxK "a" matrix, but because of the way matrix multiplication works this ends up being the same "a" matrix for both subgroups in the pair.
In the split "dpasw" case, each subgroup in the passes half of the MxK "a" matrix. This has two benefits:
- It does lower register pressure slightly, since the "a" matrix is effectively half as big. This could be beneficial for register-constrained kernels.
- It lowers bandwidth requirements, since each subgroup effectively is loading half as much "a" matrix data. We've generally found this to be the biggest benefit because even matrix multiplication kernels are frequently memory-bound.

This all works because of the way GPUs supporting "dpasw" execute subgroups - essentially two subgroups execute together, in lockstep. This means that "dpasw" may not always be better than "dpas", but it shouldn't be worse, either.

Hope this helps!

-- Ben

View solution in original post

Ben_A_Intel · ‎09-05-2023

Hi Allan, good questions, although please note explicitly that we've been intentionally vague in the description of this feature because it's something we tried in one generation of GPUs and it didn't work out as well as we liked. We may drop support for this extension for future GPUs.

It's most helpful to understand how this extension (and "dpasw") works relative to the non-split cl_intel_subgroup_matrix_multiply_accumulate (and "dpas"):

Each subgroup is computing an MxN result matrix in both cases.
Each subgroup is loading and passing its own unique KxN "b" matrix and MxN "acc" matrix sources in both cases.
In the non-split "dpas" case, each subgroup passes its own MxK "a" matrix, but because of the way matrix multiplication works this ends up being the same "a" matrix for both subgroups in the pair.
In the split "dpasw" case, each subgroup in the passes half of the MxK "a" matrix. This has two benefits:
- It does lower register pressure slightly, since the "a" matrix is effectively half as big. This could be beneficial for register-constrained kernels.
- It lowers bandwidth requirements, since each subgroup effectively is loading half as much "a" matrix data. We've generally found this to be the biggest benefit because even matrix multiplication kernels are frequently memory-bound.

This all works because of the way GPUs supporting "dpasw" execute subgroups - essentially two subgroups execute together, in lockstep. This means that "dpasw" may not always be better than "dpas", but it shouldn't be worse, either.

Hope this helps!

-- Ben

allanmac · ‎09-05-2023

Thanks for the clarifications.

The EUs continue to be full of unique features!

Side note, looking forward to seeing some of these high impact Intel extensions show up in Vulkan/GLSL compute -- e.g. bfloat, matmul and block_read/write.

Performance of "cl_intel_subgroup_split_matrix_multiply_accumulate"?

OpenCL* for GPU