Solved: How many bits are required to support arbitrary shuffle operations?

Kaixi_H_ · ‎08-08-2014

Hi all,

Recently, I have some doubts about why the support of arbitrary shuffle operations is impractical while reading the "Intel Xeon Phi Coprocessor System Software Developers Guide" (link: https://software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf). On page 157, it mentions "To support fully arbitrary control sequences across all of the element muxes, however, would require 32 bits of immediate encoding on the shuffle instruction." Then, I know it is impractical because the immediate encoding bits are too long. However, I don't know why it is 32 bits. For my understanding, if there are 16 numbers in the 512-bit vector, the possible combinations should be 16^16=2^64 (including duplicate elements), which means 64 bits of immediate encoding are required. I am not sure how the 32 bits are derived?

Frances_R_Intel · ‎08-15-2014

I think what you mean by an arbitrary shuffle is not what the System Software Developers Guide means.

What you mean is the ability to put any of the 16 elements of the input vector into any of the 16 elements of the output vector (including dublications). You might picture this as having 16 muxes, one for each of the elements on the output vector. Each mux would have 16 inputs, one for each element of the input vector. You need 4 bits to set each mux (2^4=16). So for a completely and totally arbitrary shuffle, yes, you would need 64 bits - 16 muxes times 4 bits.

But this is not how the shuffle works.

Each vector is divided into 4 lanes where each lane is 128 bits of the vector, in other words each lane contains four 16 bit elements of the vector. Then there are two levels of muxes. The first level shuffles any of the 4 lanes from the input vector to any of the 4 lanes of a temporary vector (including duplications). In this case, you need 4 muxes each with 4 inputs. So that takes 4*2 bits to set the muxes, 4 muxes times 2 bits to represent the numbers from 0 to 3.

Then you have the second level of muxes, which shuffles elements. But you cannot shuffle elements across lane boundaries. There are still 16 muxes, one for each of the 16 elements of the output vector but each mux has only 4 inputs, one for each element of the temporary vector above that lies within the same lane as that element of the output vector. So if you could set each of the 16 muxes independently, you would need 16*2 bits, 16 muxes times the 2 bits it takes to represent the numbers 0 to 3. And there is your 32.

In reality, you do not set each of the muxes independently. Instead you specify the settings for each mux in the first lane of the output vector Then those same mux setting are duplicated across the other 3 lanes as well. So your immediate value is 4 muxes times 2 bits, i.e.8 bits.

View solution in original post

Frances_R_Intel · ‎08-15-2014

I think what you mean by an arbitrary shuffle is not what the System Software Developers Guide means.

What you mean is the ability to put any of the 16 elements of the input vector into any of the 16 elements of the output vector (including dublications). You might picture this as having 16 muxes, one for each of the elements on the output vector. Each mux would have 16 inputs, one for each element of the input vector. You need 4 bits to set each mux (2^4=16). So for a completely and totally arbitrary shuffle, yes, you would need 64 bits - 16 muxes times 4 bits.

But this is not how the shuffle works.

Each vector is divided into 4 lanes where each lane is 128 bits of the vector, in other words each lane contains four 16 bit elements of the vector. Then there are two levels of muxes. The first level shuffles any of the 4 lanes from the input vector to any of the 4 lanes of a temporary vector (including duplications). In this case, you need 4 muxes each with 4 inputs. So that takes 4*2 bits to set the muxes, 4 muxes times 2 bits to represent the numbers from 0 to 3.

Then you have the second level of muxes, which shuffles elements. But you cannot shuffle elements across lane boundaries. There are still 16 muxes, one for each of the 16 elements of the output vector but each mux has only 4 inputs, one for each element of the temporary vector above that lies within the same lane as that element of the output vector. So if you could set each of the 16 muxes independently, you would need 16*2 bits, 16 muxes times the 2 bits it takes to represent the numbers 0 to 3. And there is your 32.

In reality, you do not set each of the muxes independently. Instead you specify the settings for each mux in the first lane of the output vector Then those same mux setting are duplicated across the other 3 lanes as well. So your immediate value is 4 muxes times 2 bits, i.e.8 bits.

Kaixi_H_ · ‎08-15-2014

Thanks, Frances.

It makes sense to me now. My original understanding is that because the arbitrary data-reordering of 16 elements requires 64 bits, the Intel guys use two levels of data-reordering (permute and shuffle) to overcome this problem. And I believe the data-reordering is "shuffle". However, the "shuffle" in the System Software Developers Guide is always referred to the second level of the data-reordering and thus the arbitrary shuffle (second level data-reordering) needs 32 bits.