Solved: Efficiently use vector registers - Page 2

zhen_j_ · ‎03-21-2017

I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):

t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);

It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.

So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?

Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.

View solution in original post

jimdempseyatthecove · ‎04-18-2017

Do you mean you have something like:

float TensorArray[nTensorX][nTensorY][nTensorZ];

And then c[] receives a 4x4x4 section of the above array?
Then, for any given section, it is used once?

Jim Dempsey

zhen_j_ · ‎04-18-2017

Hi Jim,

Yes, that is what I mean. In each 4*4*4 section, the computation is just like the code gave in previous post. Some element in c[] will be used only once, some will be used twice or three times.

Thanks!

jimdempseyatthecove · ‎04-18-2017

Let me be clear. In post #22

nTensorX, nTensorY and nTensorZ could be numbers very much larger than 4

Whereas c has a linear dimension equivalent to 4x4x4 (64 floats).

Meaning at some point in your code you do something like

int o = 0;
for(int z=0; z<4; z++)
  for(int y=0; y<4; y++)
    for(int x=0; x<4; x++)
      c[o++] = TensorArray[zBase+z][yBase+y][xBase+x];

Then use c[] as discussed above (used once per outer loop).

If that c (subsection of TensorArray) is used once before you reload c from a different subsection of TensorArray, then it may be more efficient to reference the data directly in TensorArray as opposed to copying it into c.

Jim Demspey

zhen_j_ · ‎04-18-2017

Hi Jim,

The reason I load the data to c[] is that I want to use some SIMD operations,i.e., additions. If I just reference the data when I need to use it, will the compiler turn it into a load instruction, which load data to zmm? If so, I think the two methods should be the same.

Zhen

jimdempseyatthecove · ‎04-19-2017

>>If I just reference the data when I need to use it, will the compiler turn it into a load instruction, which load data to zmm?

This depends on two things: How the data is organized in the TensorArray, and, the access patterns. At minimum, you would get 4-wide vectors. An alternate approach would be a gather that references no more than 4 cache lines (same overhead as in the fetch portion of the loops in #24).

T1 Read equivalent of c[] from Tensor array
T2 Write to c[]
T3 Read from c[]
T4 16-wide SIMD use of c[] using entries of c[] once or twice
T5 16-wide SIMD write to output

verses

T1' gather 16-wide vectors as required (T1' == T1)
(no T2', no T3')
T4' 16-wide SIMD use of gathered elements of Tensor array (T4' == T4)
T5' 16-wide write to output

Jim Dempsey