Community
cancel
Showing results for
Did you mean:
Beginner
178 Views

## Efficiently use vector registers

I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):

```t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);```

It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.

So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?

Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?

1 Solution
Black Belt
148 Views

Hi Zhen Jia,

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.
25 Replies
Black Belt
147 Views

Hi Zhen Jia,

I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.

Beginner
147 Views

Thanks Gaston! I will try.

Black Belt
149 Views

Hi Zhen Jia,

They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.
Black Belt
147 Views

Hi Zhen Jia,

I provided you tools and documentation that I do believe they will be useful for you to optimize not only the piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start using these tools, there is no way back. :)

Beginner
147 Views

Great! Gaston,Thanks so much!

For tools, you mean Vectorization Advisor?

Black Belt
147 Views

Hi Zhen Jia,

Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools included. Basically, you get advice on how to vectorize and it makes your life really easier in such a complex topic.

Beginner
147 Views

I see. I will ues it.

Black Belt
147 Views

BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link: https://software.intel.com/en-us/intel-advisor-xe

Beginner
147 Views

Got it.

Valued Contributor II
147 Views
>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and >>store the intermediate data? A gather-like approach is possible in your case and this is because to calculate: ... t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); ... indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used. Take a look at an article: What to do when Auto Vectorization fails? . https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails
Black Belt
147 Views

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
for(int  i=5; i<8; ++I)
...

Jim Dempsey

Beginner
147 Views

Yes, exactly. That is what I want to say.

jimdempseyatthecove wrote:

>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.

The way I read this is you have

__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
for(int  i=5; i<8; ++I)
...

Jim Dempsey

Beginner
147 Views

Hi Sergey,

Yes, the  question is what you said. For Xeon phi, the gather and scatter have long latency  and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else.  So will multiple loops help?

Sergey Kostrov wrote:

>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
>>store the intermediate data?

A gather-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used.

Take a look at an article: What to do when Auto Vectorization fails?
.
https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...

Black Belt
147 Views

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.  If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

Beginner
147 Views

Hi Jim Dempsey,

By using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?

jimdempseyatthecove wrote:

zhen jia,

You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.

Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.  If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.

Jim Dempsey

Black Belt
147 Views

It might be clearest to the readers here if you show the code that populates the array c.

Jim Dempsey

Valued Contributor II
147 Views
>>...So will multiple loops help? You have codes and I think you could easily try that. I agree with Jim's point that a complete test case would help to everybody who is involved in that discussion.
Beginner
147 Views

HI Jim and Sergey,

The current computation part of the code looks like this. It depends on the compiler to optimize the peformance. I am wondering if there is other method to further optimize the code.

#define SIMD_SUB    _mm512_sub_ps

#define SIMD_STORE  _mm512_store_ps

t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));
t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));
t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));
t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));

m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);

m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);

m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);

m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);

SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);

SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);

Black Belt
147 Views

You did not answer the question: How did the data get placed into the array c[]?
And a related, equally important, question: How often is the data, or portions of data, in c[] reused without being updated?

The response may indicate if the array c is needed. IOW if you should use the data directly in its prior locations.

Jim Dempsey

Beginner
51 Views

Hi Jim,

Thanks for question. I see, in this code, c[] is a part of a tensor. c[] is 4*4*4 tensor, each time c[] strides 2 along each dimension

If it is still not clear, please let me know, thanks!