Software Archive
Read-only legacy content
17061 Discussions

Efficiently use vector registers

zhen_j_
Beginner
1,071 Views

I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):

t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);

It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.

So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?

Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?

0 Kudos
1 Solution
gaston-hillar
Valued Contributor I
1,041 Views

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation
 
They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.

View solution in original post

0 Kudos
25 Replies
gaston-hillar
Valued Contributor I
895 Views

Hi Zhen Jia,

I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.

The following link provides videos and tutorials that will answer most of your questions related to vectorisation and the usage of Vectorization Advisor: https://software.intel.com/en-us/articles/vectorization-advisor-faq

0 Kudos
zhen_j_
Beginner
895 Views

Thanks Gaston! I will try.

0 Kudos
gaston-hillar
Valued Contributor I
1,042 Views

Hi Zhen Jia,

The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov: https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation
 
They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.
0 Kudos
gaston-hillar
Valued Contributor I
895 Views

Hi Zhen Jia,

I provided you tools and documentation that I do believe they will be useful for you to optimize not only the piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start using these tools, there is no way back. :)

0 Kudos
zhen_j_
Beginner
895 Views

Great! Gaston,Thanks so much!

For tools, you mean Vectorization Advisor?

0 Kudos
gaston-hillar
Valued Contributor I
895 Views

Hi Zhen Jia,

Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools included. Basically, you get advice on how to vectorize and it makes your life really easier in such a complex topic.

0 Kudos
zhen_j_
Beginner
895 Views

I see. I will ues it.

0 Kudos
gaston-hillar
Valued Contributor I
895 Views

BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link: https://software.intel.com/en-us/intel-advisor-xe

 

0 Kudos
zhen_j_
Beginner
895 Views

Got it.

0 Kudos
SergeyKostrov
Valued Contributor II
895 Views
>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and >>store the intermediate data? A gather-like approach is possible in your case and this is because to calculate: ... t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); ... indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used. Take a look at an article: What to do when Auto Vectorization fails? . https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails
0 Kudos
jimdempseyatthecove
Honored Contributor III
895 Views

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
  t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int  i=5; i<8; ++I)
...

Jim Dempsey

0 Kudos
zhen_j_
Beginner
895 Views

Yes, exactly. That is what I want to say.

jimdempseyatthecove wrote:

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
  t = _mm512_store_ps(
    _mm512_add_ps(
      _mm512_add_ps(
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c),   _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
         _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int  i=5; i<8; ++I)
...

Jim Dempsey

0 Kudos
zhen_j_
Beginner
895 Views

Hi Sergey,

Yes, the  question is what you said. For Xeon phi, the gather and scatter have long latency  and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else.  So will multiple loops help?

Sergey Kostrov wrote:

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
>>store the intermediate data?

A gather-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used.

Take a look at an article: What to do when Auto Vectorization fails?
.
https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...

0 Kudos
jimdempseyatthecove
Honored Contributor III
895 Views

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.  If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

0 Kudos
zhen_j_
Beginner
895 Views

Hi Jim Dempsey,

By using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?

jimdempseyatthecove wrote:

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.  If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
895 Views

It might be clearest to the readers here if you show the code that populates the array c.

Jim Dempsey

0 Kudos
SergeyKostrov
Valued Contributor II
895 Views
>>...So will multiple loops help? You have codes and I think you could easily try that. I agree with Jim's point that a complete test case would help to everybody who is involved in that discussion.
0 Kudos
zhen_j_
Beginner
895 Views

HI Jim and Sergey,

The current computation part of the code looks like this. It depends on the compiler to optimize the peformance. I am wondering if there is other method to further optimize the code.

#define SIMD_ADD    _mm512_add_ps
#define SIMD_SUB    _mm512_sub_ps

#define SIMD_LOAD   _mm512_load_ps
#define SIMD_STORE  _mm512_store_ps

 

 

                t[0]  =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[0],c[16]), c[32]) , SIMD_ADD(SIMD_ADD(c[4],c[20]) ,c[36])) ,SIMD_ADD(SIMD_ADD(c[8] ,c[24]),c[40]));
                t[1]  =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[1],c[17]), c[33]) , SIMD_ADD(SIMD_ADD(c[5],c[21]) ,c[37])) ,SIMD_ADD(SIMD_ADD(c[9] ,c[25]),c[41]));
                t[2]  =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[2],c[18]), c[34]) , SIMD_ADD(SIMD_ADD(c[6],c[22]) ,c[38])) ,SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42]));
                t[3]  =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[3],c[19]), c[35]) , SIMD_ADD(SIMD_ADD(c[7],c[23]) ,c[39])) ,SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43]));

                t[4]  =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[4],c[20]), c[36]) , SIMD_ADD(SIMD_ADD(c[8],c[24]) ,c[40])), SIMD_ADD(SIMD_ADD(c[12],c[28]),c[44]));
                t[5]  =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[5],c[21]), c[37]) , SIMD_ADD(SIMD_ADD(c[9],c[25]) ,c[41])), SIMD_ADD(SIMD_ADD(c[13],c[29]),c[45]));
                t[6]  =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[6],c[22]), c[38]) , SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42])), SIMD_ADD(SIMD_ADD(c[14],c[30]),c[46]));
                t[7]  =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[7],c[23]), c[39]) , SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43])), SIMD_ADD(SIMD_ADD(c[15],c[31]),c[47]));

                t[8]  =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[16],c[32]), c[48]) , SIMD_SUB(SIMD_SUB(c[20],c[36]),c[52])) ,SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56]));
                t[9]  =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[17],c[33]), c[49]) , SIMD_SUB(SIMD_SUB(c[21],c[37]),c[53])) ,SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57]));
                t[10] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[18],c[34]), c[50]) , SIMD_SUB(SIMD_SUB(c[22],c[38]),c[54])) ,SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58]));
                t[11] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[19],c[35]), c[51]) , SIMD_SUB(SIMD_SUB(c[23],c[39]),c[55])) ,SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59]));

                t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));
                t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));
                t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));
                t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));

m[0] = SIMD_ADD(t[0],SIMD_ADD(t[1],t[2]));
                m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);

                m[2] = SIMD_ADD(SIMD_ADD(t[4],t[5]),t[6]);
                m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);

                m[4] = SIMD_ADD(SIMD_ADD(t[8],t[9]),t[10]);
                m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);

                m[6] = SIMD_ADD(SIMD_ADD(t[12],t[13]),t[14]);
                m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);

               SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);
               SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);
               SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);
               SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);

               SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);
               SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);
               SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);
               SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
895 Views

You did not answer the question: How did the data get placed into the array c[]?
And a related, equally important, question: How often is the data, or portions of data, in c[] reused without being updated?

The response may indicate if the array c is needed. IOW if you should use the data directly in its prior locations.

Jim Dempsey

0 Kudos
zhen_j_
Beginner
799 Views

Hi Jim,

Thanks for question. I see, in this code, c[] is a part of a tensor. c[] is 4*4*4 tensor, each time c[] strides 2 along each dimension

If it is still not clear, please let me know, thanks!

0 Kudos
Reply