Solved: Efficiently use vector registers

zhen_j_ · ‎03-21-2017

I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):

t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);

It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.

So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?

Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.

View solution in original post

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.

The following link provides videos and tutorials that will answer most of your questions related to vectorisation and the usage of Vectorization Advisor: https://software.intel.com/en-us/articles/vectorization-advisor-faq

zhen_j_ · ‎03-29-2017

Thanks Gaston! I will try.

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

I provided you tools and documentation that I do believe they will be useful for you to optimize not only the piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start using these tools, there is no way back. :)

zhen_j_ · ‎03-29-2017

Great! Gaston,Thanks so much!

For tools, you mean Vectorization Advisor?

gaston-hillar · ‎03-29-2017

Hi Zhen Jia,

Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools included. Basically, you get advice on how to vectorize and it makes your life really easier in such a complex topic.

zhen_j_ · ‎03-29-2017

I see. I will ues it.

gaston-hillar · ‎03-29-2017

BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link: https://software.intel.com/en-us/intel-advisor-xe

zhen_j_ · ‎03-29-2017

Got it.

SergeyKostrov · ‎04-06-2017

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and >>store the intermediate data? A gather-like approach is possible in your case and this is because to calculate: ... t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); ... indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used. Take a look at an article: What to do when Auto Vectorization fails? . https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails

jimdempseyatthecove · ‎04-06-2017

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...

Jim Dempsey

zhen_j_ · ‎04-09-2017

Yes, exactly. That is what I want to say.

jimdempseyatthecove wrote:

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...

Jim Dempsey

zhen_j_ · ‎04-09-2017

Hi Sergey,

Yes, the question is what you said. For Xeon phi, the gather and scatter have long latency and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else. So will multiple loops help?

Sergey Kostrov wrote:

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
>>store the intermediate data?

A gather-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used.

Take a look at an article: What to do when Auto Vectorization fails?
.
https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...

jimdempseyatthecove · ‎04-10-2017

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

zhen_j_ · ‎04-10-2017

Hi Jim Dempsey,

By using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?

jimdempseyatthecove wrote:

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

jimdempseyatthecove · ‎04-11-2017

It might be clearest to the readers here if you show the code that populates the array c.

Jim Dempsey

SergeyKostrov · ‎04-12-2017

>>...So will multiple loops help? You have codes and I think you could easily try that. I agree with Jim's point that a complete test case would help to everybody who is involved in that discussion.

zhen_j_ · ‎04-13-2017

HI Jim and Sergey,

The current computation part of the code looks like this. It depends on the compiler to optimize the peformance. I am wondering if there is other method to further optimize the code.

#define SIMD_ADD _mm512_add_ps
#define SIMD_SUB _mm512_sub_ps

#define SIMD_LOAD _mm512_load_ps
#define SIMD_STORE _mm512_store_ps

t[0] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[0],c[16]), c[32]) , SIMD_ADD(SIMD_ADD(c[4],c[20]) ,c[36])) ,SIMD_ADD(SIMD_ADD(c[8] ,c[24]),c[40]));
t[1] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[1],c[17]), c[33]) , SIMD_ADD(SIMD_ADD(c[5],c[21]) ,c[37])) ,SIMD_ADD(SIMD_ADD(c[9] ,c[25]),c[41]));
t[2] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[2],c[18]), c[34]) , SIMD_ADD(SIMD_ADD(c[6],c[22]) ,c[38])) ,SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42]));
t[3] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[3],c[19]), c[35]) , SIMD_ADD(SIMD_ADD(c[7],c[23]) ,c[39])) ,SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43]));

t[4] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[4],c[20]), c[36]) , SIMD_ADD(SIMD_ADD(c[8],c[24]) ,c[40])), SIMD_ADD(SIMD_ADD(c[12],c[28]),c[44]));
t[5] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[5],c[21]), c[37]) , SIMD_ADD(SIMD_ADD(c[9],c[25]) ,c[41])), SIMD_ADD(SIMD_ADD(c[13],c[29]),c[45]));
t[6] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[6],c[22]), c[38]) , SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42])), SIMD_ADD(SIMD_ADD(c[14],c[30]),c[46]));
t[7] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[7],c[23]), c[39]) , SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43])), SIMD_ADD(SIMD_ADD(c[15],c[31]),c[47]));

t[8] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[16],c[32]), c[48]) , SIMD_SUB(SIMD_SUB(c[20],c[36]),c[52])) ,SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56]));
t[9] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[17],c[33]), c[49]) , SIMD_SUB(SIMD_SUB(c[21],c[37]),c[53])) ,SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57]));
t[10] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[18],c[34]), c[50]) , SIMD_SUB(SIMD_SUB(c[22],c[38]),c[54])) ,SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58]));
t[11] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[19],c[35]), c[51]) , SIMD_SUB(SIMD_SUB(c[23],c[39]),c[55])) ,SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59]));

t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));
t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));
t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));
t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));

m[0] = SIMD_ADD(t[0],SIMD_ADD(t[1],t[2]));
m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);

m[2] = SIMD_ADD(SIMD_ADD(t[4],t[5]),t[6]);
m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);

m[4] = SIMD_ADD(SIMD_ADD(t[8],t[9]),t[10]);
m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);

m[6] = SIMD_ADD(SIMD_ADD(t[12],t[13]),t[14]);
m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);

SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);

SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);

jimdempseyatthecove · ‎04-14-2017

You did not answer the question: How did the data get placed into the array c[]?
And a related, equally important, question: How often is the data, or portions of data, in c[] reused without being updated?

The response may indicate if the array c is needed. IOW if you should use the data directly in its prior locations.

Jim Dempsey

zhen_j_ · ‎04-17-2017

Hi Jim,

Thanks for question. I see, in this code, c[] is a part of a tensor. c[] is 4*4*4 tensor, each time c[] strides 2 along each dimension.

If it is still not clear, please let me know, thanks!