topic Hi Jim, in Software Archive

Efficiently use vector registers

zhen_j_ — Tue, 21 Mar 2017 16:46:37 GMT

I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):

t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);

It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.

So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?

Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?

Hi Zhen Jia,

gaston-hillar — Thu, 30 Mar 2017 01:47:24 GMT

Hi Zhen Jia,

I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.

The following link provides videos and tutorials that will answer most of your questions related to vectorisation and the usage of Vectorization Advisor: https://software.intel.com/en-us/articles/vectorization-advisor-faq

Thanks Gaston! I will try.

zhen_j_ — Thu, 30 Mar 2017 01:50:50 GMT

Thanks Gaston! I will try.

Hi Zhen Jia,

gaston-hillar — Thu, 30 Mar 2017 02:03:00 GMT

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.

Hi Zhen Jia,

gaston-hillar — Thu, 30 Mar 2017 02:05:12 GMT

Hi Zhen Jia,

I provided you tools and documentation that I do believe they will be useful for you to optimize not only the piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start using these tools, there is no way back. :)

Great! Gaston,Thanks so much!

zhen_j_ — Thu, 30 Mar 2017 02:08:31 GMT

Great! Gaston,Thanks so much!

For tools, you mean Vectorization Advisor?

Hi Zhen Jia,

gaston-hillar — Thu, 30 Mar 2017 02:13:27 GMT

Hi Zhen Jia,

Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools included. Basically, you get advice on how to vectorize and it makes your life really easier in such a complex topic.

I see. I will ues it.

zhen_j_ — Thu, 30 Mar 2017 02:15:08 GMT

I see. I will ues it.

BTW, I forgot to mention a

gaston-hillar — Thu, 30 Mar 2017 02:15:22 GMT

BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link: https://software.intel.com/en-us/intel-advisor-xe

Got it.

zhen_j_ — Thu, 30 Mar 2017 02:20:17 GMT

Got it.

>>...So the question is that

SergeyKostrov — Thu, 06 Apr 2017 18:03:00 GMT

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and >>store the intermediate data? A gather-like approach is possible in your case and this is because to calculate: ... t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); ... indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used. Take a look at an article: What to do when Auto Vectorization fails? . https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails

>>For each element of c, like

jimdempseyatthecove — Thu, 06 Apr 2017 19:13:00 GMT

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...

Jim Dempsey

Yes, exactly. That is what I

zhen_j_ — Mon, 10 Apr 2017 02:43:33 GMT

Yes, exactly. That is what I want to say.

jimdempseyatthecove wrote:

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...

Jim Dempsey

Hi Sergey,

zhen_j_ — Mon, 10 Apr 2017 02:49:55 GMT

Hi Sergey,

Yes, the question is what you said. For Xeon phi, the gather and scatter have long latency and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else. So will multiple loops help?

Sergey Kostrov wrote:

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
>>store the intermediate data?

A gather-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used.

Take a look at an article: What to do when Auto Vectorization fails?
.
https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...

zhen jia,

jimdempseyatthecove — Mon, 10 Apr 2017 12:03:23 GMT

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

Hi Jim Dempsey,

zhen_j_ — Mon, 10 Apr 2017 13:55:02 GMT

Hi Jim Dempsey,

By using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?

jimdempseyatthecove wrote:

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

It might be clearest to the

jimdempseyatthecove — Tue, 11 Apr 2017 14:09:56 GMT

It might be clearest to the readers here if you show the code that populates the array c.

Jim Dempsey

>>...So will multiple loops

SergeyKostrov — Wed, 12 Apr 2017 19:24:19 GMT

>>...So will multiple loops help? You have codes and I think you could easily try that. I agree with Jim's point that a complete test case would help to everybody who is involved in that discussion.

HI Jim and Sergey,

zhen_j_ — Fri, 14 Apr 2017 01:38:00 GMT

HI Jim and Sergey,

The current computation part of the code looks like this. It depends on the compiler to optimize the peformance. I am wondering if there is other method to further optimize the code.

#define SIMD_ADD _mm512_add_ps
#define SIMD_SUB _mm512_sub_ps

#define SIMD_LOAD _mm512_load_ps
#define SIMD_STORE _mm512_store_ps

t[0] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[0],c[16]), c[32]) , SIMD_ADD(SIMD_ADD(c[4],c[20]) ,c[36])) ,SIMD_ADD(SIMD_ADD(c[8] ,c[24]),c[40]));
t[1] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[1],c[17]), c[33]) , SIMD_ADD(SIMD_ADD(c[5],c[21]) ,c[37])) ,SIMD_ADD(SIMD_ADD(c[9] ,c[25]),c[41]));
t[2] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[2],c[18]), c[34]) , SIMD_ADD(SIMD_ADD(c[6],c[22]) ,c[38])) ,SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42]));
t[3] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[3],c[19]), c[35]) , SIMD_ADD(SIMD_ADD(c[7],c[23]) ,c[39])) ,SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43]));

t[4] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[4],c[20]), c[36]) , SIMD_ADD(SIMD_ADD(c[8],c[24]) ,c[40])), SIMD_ADD(SIMD_ADD(c[12],c[28]),c[44]));
t[5] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[5],c[21]), c[37]) , SIMD_ADD(SIMD_ADD(c[9],c[25]) ,c[41])), SIMD_ADD(SIMD_ADD(c[13],c[29]),c[45]));
t[6] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[6],c[22]), c[38]) , SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42])), SIMD_ADD(SIMD_ADD(c[14],c[30]),c[46]));
t[7] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[7],c[23]), c[39]) , SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43])), SIMD_ADD(SIMD_ADD(c[15],c[31]),c[47]));

t[8] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[16],c[32]), c[48]) , SIMD_SUB(SIMD_SUB(c[20],c[36]),c[52])) ,SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56]));
t[9] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[17],c[33]), c[49]) , SIMD_SUB(SIMD_SUB(c[21],c[37]),c[53])) ,SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57]));
t[10] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[18],c[34]), c[50]) , SIMD_SUB(SIMD_SUB(c[22],c[38]),c[54])) ,SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58]));
t[11] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[19],c[35]), c[51]) , SIMD_SUB(SIMD_SUB(c[23],c[39]),c[55])) ,SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59]));

t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));
t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));
t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));
t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));

m[0] = SIMD_ADD(t[0],SIMD_ADD(t[1],t[2]));
m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);

m[2] = SIMD_ADD(SIMD_ADD(t[4],t[5]),t[6]);
m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);

m[4] = SIMD_ADD(SIMD_ADD(t[8],t[9]),t[10]);
m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);

m[6] = SIMD_ADD(SIMD_ADD(t[12],t[13]),t[14]);
m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);

SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);

SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);

You did not answer the

jimdempseyatthecove — Fri, 14 Apr 2017 12:04:00 GMT

You did not answer the question: How did the data get placed into the array c[]?
And a related, equally important, question: How often is the data, or portions of data, in c[] reused without being updated?

The response may indicate if the array c is needed. IOW if you should use the data directly in its prior locations.

Jim Dempsey