- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]); t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]); t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]); t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]); t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]); t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]); t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]); t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]); t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]); t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]); t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);
It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.
So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?
Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.
The following link provides videos and tutorials that will answer most of your questions related to vectorisation and the usage of Vectorization Advisor: https://software.intel.com/en-us/articles/vectorization-advisor-faq
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Gaston! I will try.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
I provided you tools and documentation that I do believe they will be useful for you to optimize not only the piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start using these tools, there is no way back. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great! Gaston,Thanks so much!
For tools, you mean Vectorization Advisor?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools included. Basically, you get advice on how to vectorize and it makes your life really easier in such a complex topic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see. I will ues it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link: https://software.intel.com/en-us/intel-advisor-xe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Got it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.
The way I read this is you have
__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
_mm512_add_ps(
_mm512_add_ps(
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c), _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, exactly. That is what I want to say.
jimdempseyatthecove wrote:
>>For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.
The way I read this is you have
__m512 c[64];
__m512 t[16];
...
for(int i=0; i<4; ++i)
t = _mm512_store_ps(
_mm512_add_ps(
_mm512_add_ps(
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c), _mm512_load_ps(&c[i+16])), _mm512_load_ps(&c[i+32])),
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+4]), _mm512_load_ps(&c[i+20])), _mm512_load_ps(&c[i+36]))),
_mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&c[i+8]), _mm512_load_ps(&c[i+24])), _mm512_load_ps(&c[i+40])))));
for(int i=5; i<8; ++I)
...Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
Yes, the question is what you said. For Xeon phi, the gather and scatter have long latency and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else. So will multiple loops help?
Sergey Kostrov wrote:
>>...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
>>store the intermediate data?A gather-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices 0, 4, 8, 16, 20, 24, 32, 36, 40 need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one for-loop three independent for-loops could be used.Take a look at an article: What to do when Auto Vectorization fails?
.
https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
zhen jia,
You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.
Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim Dempsey,
By using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?
jimdempseyatthecove wrote:
zhen jia,
You should be able to write operator overloads such that you can remove the _mm512_...'s
IOW to use the original expressions as in your first post.Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice. If you are NOT using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It might be clearest to the readers here if you show the code that populates the array c.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI Jim and Sergey,
The current computation part of the code looks like this. It depends on the compiler to optimize the peformance. I am wondering if there is other method to further optimize the code.
#define SIMD_ADD _mm512_add_ps
#define SIMD_SUB _mm512_sub_ps
#define SIMD_LOAD _mm512_load_ps
#define SIMD_STORE _mm512_store_ps
t[0] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[0],c[16]), c[32]) , SIMD_ADD(SIMD_ADD(c[4],c[20]) ,c[36])) ,SIMD_ADD(SIMD_ADD(c[8] ,c[24]),c[40]));
t[1] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[1],c[17]), c[33]) , SIMD_ADD(SIMD_ADD(c[5],c[21]) ,c[37])) ,SIMD_ADD(SIMD_ADD(c[9] ,c[25]),c[41]));
t[2] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[2],c[18]), c[34]) , SIMD_ADD(SIMD_ADD(c[6],c[22]) ,c[38])) ,SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42]));
t[3] =SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[3],c[19]), c[35]) , SIMD_ADD(SIMD_ADD(c[7],c[23]) ,c[39])) ,SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43]));
t[4] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[4],c[20]), c[36]) , SIMD_ADD(SIMD_ADD(c[8],c[24]) ,c[40])), SIMD_ADD(SIMD_ADD(c[12],c[28]),c[44]));
t[5] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[5],c[21]), c[37]) , SIMD_ADD(SIMD_ADD(c[9],c[25]) ,c[41])), SIMD_ADD(SIMD_ADD(c[13],c[29]),c[45]));
t[6] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[6],c[22]), c[38]) , SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42])), SIMD_ADD(SIMD_ADD(c[14],c[30]),c[46]));
t[7] =SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[7],c[23]), c[39]) , SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43])), SIMD_ADD(SIMD_ADD(c[15],c[31]),c[47]));
t[8] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[16],c[32]), c[48]) , SIMD_SUB(SIMD_SUB(c[20],c[36]),c[52])) ,SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56]));
t[9] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[17],c[33]), c[49]) , SIMD_SUB(SIMD_SUB(c[21],c[37]),c[53])) ,SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57]));
t[10] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[18],c[34]), c[50]) , SIMD_SUB(SIMD_SUB(c[22],c[38]),c[54])) ,SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58]));
t[11] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[19],c[35]), c[51]) , SIMD_SUB(SIMD_SUB(c[23],c[39]),c[55])) ,SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59]));
t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));
t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));
t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));
t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));
m[0] = SIMD_ADD(t[0],SIMD_ADD(t[1],t[2]));
m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);
m[2] = SIMD_ADD(SIMD_ADD(t[4],t[5]),t[6]);
m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);
m[4] = SIMD_ADD(SIMD_ADD(t[8],t[9]),t[10]);
m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);
m[6] = SIMD_ADD(SIMD_ADD(t[12],t[13]),t[14]);
m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);
SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);
SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You did not answer the question: How did the data get placed into the array c[]?
And a related, equally important, question: How often is the data, or portions of data, in c[] reused without being updated?
The response may indicate if the array c is needed. IOW if you should use the data directly in its prior locations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks for question. I see, in this code, c[] is a part of a tensor. c[] is 4*4*4 tensor, each time c[] strides 2 along each dimension.
If it is still not clear, please let me know, thanks!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page