- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]); t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]); t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]); t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]); t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]); t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]); t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]); t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]); t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]); t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]); t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]); t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);
It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.
So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?
Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any suggestion?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhen Jia,
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean you have something like:
float TensorArray[nTensorX][nTensorY][nTensorZ];
And then c[] receives a 4x4x4 section of the above array?
Then, for any given section, it is used once?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Yes, that is what I mean. In each 4*4*4 section, the computation is just like the code gave in previous post. Some element in c[] will be used only once, some will be used twice or three times.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me be clear. In post #22
nTensorX, nTensorY and nTensorZ could be numbers very much larger than 4
Whereas c has a linear dimension equivalent to 4x4x4 (64 floats).
Meaning at some point in your code you do something like
int o = 0; for(int z=0; z<4; z++) for(int y=0; y<4; y++) for(int x=0; x<4; x++) c[o++] = TensorArray[zBase+z][yBase+y][xBase+x];
Then use c[] as discussed above (used once per outer loop).
If that c (subsection of TensorArray) is used once before you reload c from a different subsection of TensorArray, then it may be more efficient to reference the data directly in TensorArray as opposed to copying it into c.
Jim Demspey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
The reason I load the data to c[] is that I want to use some SIMD operations,i.e., additions. If I just reference the data when I need to use it, will the compiler turn it into a load instruction, which load data to zmm? If so, I think the two methods should be the same.
Zhen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>If I just reference the data when I need to use it, will the compiler turn it into a load instruction, which load data to zmm?
This depends on two things: How the data is organized in the TensorArray, and, the access patterns. At minimum, you would get 4-wide vectors. An alternate approach would be a gather that references no more than 4 cache lines (same overhead as in the fetch portion of the loops in #24).
T1 Read equivalent of c[] from Tensor array
T2 Write to c[]
T3 Read from c[]
T4 16-wide SIMD use of c[] using entries of c[] once or twice
T5 16-wide SIMD write to output
verses
T1' gather 16-wide vectors as required (T1' == T1)
(no T2', no T3')
T4' 16-wide SIMD use of gathered elements of Tensor array (T4' == T4)
T5' 16-wide write to output
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »