topic How vtune compute bandwith? in Software Tuning, Performance Optimization & Platform Monitoring
https://community.intel.com/t5/Software-Tuning-Performance/How-vtune-compute-bandwith/m-p/1053487#M4957
<P>Hi, I am analyzing a simulated cannealling program from parsec. The program often access elem data randomly, so it have poor performance. I add a prefetching instruction for elem, and I am glad to see the time of parallel region with multiple threads has been reduced from 31 second to 15 second. Indeed it is a good result. I just prefetch the data in advance one iteration, and I wish get more performance improvement. But after adjusting the prefetching parameter, I cannot get much better result. So I doubt the prefetching has used up all bandwidth when <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">prefetching the data in advance one iteration. So I check the bandwidth with vtune bandwidth analysis after the prefetching, and I found that the bandwidth only was increased a few from 3.004GB/s to 3.268 (for a single package). I feel the result is not right. Since adding prefetching do not add loaded data size, the time is reduced to a half from 31s to 15s, the bandwidth should be equal to DATA_SIZE/TIME, so the bandwidth should be doubled. Anyone has some ideas about the result. Or know how to compute bandwidth by vtune?</SPAN></P>
<P><SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">I do an experiment with the following program:</SPAN></P>
<P>#include <stdio.h><BR />
#include <unistd.h><BR />
#define SIZE 100*1024*1024<BR />
double a[SIZE];<BR />
double b[SIZE];<BR />
double c[SIZE];<BR />
int<BR />
main()<BR />
{<BR />
int i;<BR />
for (i=0; i< SIZE; i++) {<BR />
a<I> = 0.;<BR />
b<I> = 0.;<BR />
c<I> = 0.;<BR />
}<BR />
sleep(2);</I></I></I></P>
<P>// continuing access<BR />
for (i=0; i< SIZE; i++) {<BR />
a<I> = b<I> + c<I>;<BR />
}<BR />
sleep(2);</I></I></I></P>
<P>//stride access<BR />
for (i=0; i< SIZE; i+=8) {<BR />
a<I> = b<I> + c<I>;<BR />
}<BR />
sleep(2);<BR />
}</I></I></I></P>
<P>Vtune shows that <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">the bandwidth for </SPAN>the part with <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">continuing access is 4.16GB/s, and the running time is 0.7s, I compute the bandwidth is 4.46GB/s(100MB*4*8/0.7). The stride acces is 8.3GB/s, the running time is 0.4s, I compute the bandwidth is 7.8GB/s(100MB*4*8/0.4). It seems that my computation method is almost right.</SPAN></P>Thu, 25 Jun 2015 13:22:47 GMTHUIZHAN_Y_2015-06-25T13:22:47ZHow vtune compute bandwith?
https://community.intel.com/t5/Software-Tuning-Performance/How-vtune-compute-bandwith/m-p/1053487#M4957
<P>Hi, I am analyzing a simulated cannealling program from parsec. The program often access elem data randomly, so it have poor performance. I add a prefetching instruction for elem, and I am glad to see the time of parallel region with multiple threads has been reduced from 31 second to 15 second. Indeed it is a good result. I just prefetch the data in advance one iteration, and I wish get more performance improvement. But after adjusting the prefetching parameter, I cannot get much better result. So I doubt the prefetching has used up all bandwidth when <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">prefetching the data in advance one iteration. So I check the bandwidth with vtune bandwidth analysis after the prefetching, and I found that the bandwidth only was increased a few from 3.004GB/s to 3.268 (for a single package). I feel the result is not right. Since adding prefetching do not add loaded data size, the time is reduced to a half from 31s to 15s, the bandwidth should be equal to DATA_SIZE/TIME, so the bandwidth should be doubled. Anyone has some ideas about the result. Or know how to compute bandwidth by vtune?</SPAN></P>
<P><SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">I do an experiment with the following program:</SPAN></P>
<P>#include <stdio.h><BR />
#include <unistd.h><BR />
#define SIZE 100*1024*1024<BR />
double a[SIZE];<BR />
double b[SIZE];<BR />
double c[SIZE];<BR />
int<BR />
main()<BR />
{<BR />
int i;<BR />
for (i=0; i< SIZE; i++) {<BR />
a<I> = 0.;<BR />
b<I> = 0.;<BR />
c<I> = 0.;<BR />
}<BR />
sleep(2);</I></I></I></P>
<P>// continuing access<BR />
for (i=0; i< SIZE; i++) {<BR />
a<I> = b<I> + c<I>;<BR />
}<BR />
sleep(2);</I></I></I></P>
<P>//stride access<BR />
for (i=0; i< SIZE; i+=8) {<BR />
a<I> = b<I> + c<I>;<BR />
}<BR />
sleep(2);<BR />
}</I></I></I></P>
<P>Vtune shows that <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">the bandwidth for </SPAN>the part with <SPAN style="font-size: 13.0080003738403px; line-height: 11.1497144699097px;">continuing access is 4.16GB/s, and the running time is 0.7s, I compute the bandwidth is 4.46GB/s(100MB*4*8/0.7). The stride acces is 8.3GB/s, the running time is 0.4s, I compute the bandwidth is 7.8GB/s(100MB*4*8/0.4). It seems that my computation method is almost right.</SPAN></P>Thu, 25 Jun 2015 13:22:47 GMThttps://community.intel.com/t5/Software-Tuning-Performance/How-vtune-compute-bandwith/m-p/1053487#M4957HUIZHAN_Y_2015-06-25T13:22:47ZI understand that it seems
https://community.intel.com/t5/Software-Tuning-Performance/How-vtune-compute-bandwith/m-p/1053488#M4958
<P>I understand that it seems there are some problems. I think maybe the DATA_SIZE has been changed after adding prefetching. When adding prefetching, the cache may have more effective data, and the data locality is improved. So although no prefetch case need multiple loading from memory, the prefetch case can find some reusing data. So the improvement of data locality leads to better performance, and at the same time the memory bandwidth is not increased significantly.</P>Thu, 25 Jun 2015 20:06:42 GMThttps://community.intel.com/t5/Software-Tuning-Performance/How-vtune-compute-bandwith/m-p/1053488#M4958HUIZHAN_Y_2015-06-25T20:06:42Z